Autoencoder and Active Learning to Reduce False Positive Warnings in a Slippery Road Alert System Using unlabeled multivariate time series Master’s thesis in Computer science and engineering Philip Axenhamn Andreas Greppe Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Autoencoder and Active Learning to Reduce False Positive Warnings in a Slippery Road Alert System Using unlabeled multivariate time series Philip Axenhamn Andreas Greppe Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 Autoencoder and Active Learning to Reduce False Positive Warnings in a Slippery Road Alert System Using unlabeled multivariate time series Philip Axenhamn Andreas Greppe © Philip Axenhamn, 2023. © Andreas Greppe, 2023. Supervisor: Selpi, Department of Computer Science and Engineering Advisor: Per-Olof Nilsson, Volvo Cars Examiner: Peter Damaschke, Department of Computer Science and Engineering Master’s Thesis 2023 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Printed by Chalmers Reproservice Gothenburg, Sweden 2023 iv Abstract Modern vehicles are commonly equipped with a slippery road alert (SRA) system that warns drivers of slippery roads. Current implementations of this system occa- sionally produce warnings when the road is not slippery. These warnings are called false positives and can harm the system’s trustworthiness. In this thesis, we propose a false positive filter capable of reducing the number of false positive alerts gener- ated by an SRA system based on two machine learning techniques: autoencoder and active learning. The available data was completely unlabeled and contained few informative features. The analysis of this data showed that vehicles send bursts of data points forming sequences. Due to the limitations of the data, multi-variate time series were constructed with the idea that the sequential data might reveal more about the situation than a single-point measurement. Furthermore, the sequences were grouped into true- and false-positive classes based on assumptions of the causes for the alerts, such as driving on ice or a speed bump. The sequential data was used to train GRU- and LSTM-based autoencoders and classifiers to detect sequences that correspond to false positive situations such that they can be removed. The hyperparameters for the models were tuned using Optuna and the best-performing models with the most optimal hyperparameters were further evaluated. Since the data was not labeled, the actual performance of the proposed solution could not be assessed. Instead, the evaluation was based on computing the proportion of re- maining assumed true positive (ATP) sequences and assumed false positive (AFP) sequences after filtering. The results show that the LSTM autoencoder could find patterns in the sequential data and was able to remove 43% of the AFP sequences while retaining 90% of the ATP sequences. The active learning approach proved to not work well with the available data. Keywords: slippery road alert, false positive, machine learning, unsupervised learn- ing, semi-supervised learning, autoencoder, active learning, anomaly detection v Acknowledgements We would like to express our gratitude to everyone who has supported us during the project. We want to thank our supervisor Selpi from Chalmers, who gave us valuable suggestions and helped us form ideas. We also thank the people at Volvo Cars who made the project possible. A special thanks to Fredrik and Per-Olof who provided feedback and help throughout the project. Philip Axenhamn & Andreas Greppe Gothenburg, 2023-06-19 vii Contents List of Figures xi List of Tables xiii List of Acronyms xv 1 Introduction 1 1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work 5 3 Theory 7 3.1 Friction Between Tire and Road Surface . . . . . . . . . . . . . . . . 7 3.2 Estimating The Coefficient of Friction . . . . . . . . . . . . . . . . . . 8 3.2.1 Vehicle Operating Parameters . . . . . . . . . . . . . . . . . . 9 3.2.2 Tire Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.3 Road Characteristics . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.4 Environmental Factors . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 11 3.3.2 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . 11 3.3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.4 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . 14 3.4 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . 20 3.5.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . 21 3.5.3 Gated Recurrent Unit . . . . . . . . . . . . . . . . . . . . . . 22 3.5.4 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 Spatial Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Data Description and Analysis 27 ix Contents 4.1 Available Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1 Vehicle Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.2 Slippery Road Alert Data . . . . . . . . . . . . . . . . . . . . 30 4.1.3 Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.4 Road data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Methods 37 5.1 Defining True- and False Positive SRAs . . . . . . . . . . . . . . . . . 37 5.2 False Positive Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.1 Sequence Generation . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.2 Sequence Mapping . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3.3 Sequence Grouping . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4 Final Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.4.1 Autoencoder Data Set . . . . . . . . . . . . . . . . . . . . . . 44 5.4.2 Active Learning Data Set . . . . . . . . . . . . . . . . . . . . . 45 5.5 Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5.1 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6 Results 57 6.1 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7 Discussion and Conclusion 73 7.1 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.3 Available Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 Ethical and Legal Considerations . . . . . . . . . . . . . . . . . . . . 79 7.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.7 Final Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Bibliography 85 A Appendix I A.1 Autoencoder Experiment 3 - All trials . . . . . . . . . . . . . . . . . . I A.2 Autoencoder Experiment 3 - Hyperparameter importance . . . . . . . II x List of Figures 3.1 Simplified visualization of the forces acting on a rotating wheel. . . . 8 3.2 Pool-based active learning. . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 How DBSCAN works. . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 A single layer perceptron with three input variables and one output. . 16 3.5 A multi-layer perceptron. . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6 Three common activation functions. . . . . . . . . . . . . . . . . . . . 18 3.7 The unrolling of a recurrent neural network. . . . . . . . . . . . . . . 20 3.8 Long short-term memory (LSTM) cell. . . . . . . . . . . . . . . . . . 22 3.9 Gated recurrent unit (GRU) cell. . . . . . . . . . . . . . . . . . . . . 23 3.10 The structure of an autoencoder. . . . . . . . . . . . . . . . . . . . . 24 3.11 Spatial hashing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1 Sample of data points from the vehicle data plotted on a map. . . . . 29 4.2 Number of generated alerts for each season within a year. . . . . . . . 32 4.3 Number of generated slippery road alerts in Gothenburg during the project (late winter and early spring). . . . . . . . . . . . . . . . . . . 33 4.4 Number of daily generated alerts together with the daily air temper- ature and precipitation during summer 2022 in Gothenburg. . . . . . 34 4.5 Heatmap of generated slippery road alerts within an area in Gothen- burg during summer 2022. . . . . . . . . . . . . . . . . . . . . . . . . 34 4.6 Number of friction measurements against the daily air temperature and daily precipitation levels. . . . . . . . . . . . . . . . . . . . . . . 35 5.1 High-level overview of the proposed false positive filter. . . . . . . . . 39 5.2 Temporal clustering of the raw vehicle data. . . . . . . . . . . . . . . 40 5.3 Sequence to speed bump matching. . . . . . . . . . . . . . . . . . . . 42 5.4 Assumed outcomes of different weather conditions. . . . . . . . . . . . 43 5.5 Structure of the GRU and LSTM autoencoders. . . . . . . . . . . . . 48 5.6 Baseline model for the active learning approach. . . . . . . . . . . . . 52 5.7 Final model for the active learning approach. . . . . . . . . . . . . . . 53 6.1 Mean MSE reconstruction losses of each class using the GRU- and LSTM autoencoders for setting 1 and 4. . . . . . . . . . . . . . . . . 59 6.2 Train- and validation MSE losses for settings 1 and 4 using GRU- and LSTM autoencoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 xi List of Figures 6.3 Distributions of reconstruction losses per sequence of each class using the GRU- and LSTM autoencoders for setting 1 and 4. . . . . . . . . 61 6.4 Distributions of the reconstruction losses for each class using MSE loss and scaled MSE loss. . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.5 Train and validation loss for each epoch using regular MSE loss and scaled MSE loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.6 Validation loss and loss difference for each trial with the LSTM au- toencoder using MAE as loss function. . . . . . . . . . . . . . . . . . 62 6.7 Validation loss and loss difference for each trial with the LSTM au- toencoder using MSE as loss function. . . . . . . . . . . . . . . . . . 63 6.8 Distributions of reconstruction losses for each class and for each model trained with hyperparameters from the selected trials. . . . . . . . . . 64 6.9 Train- and validation loss during training of models with hyperparam- eters from the selected trials. . . . . . . . . . . . . . . . . . . . . . . . 65 6.10 Distributions of the reconstruction losses for each class and two dif- ferent thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.11 Proportion of remaining sequences after filtering using both thresholds. 67 6.12 Active learning iteration for least certain sampling, test 7. . . . . . . 70 6.13 Active learning iteration for margin sampling, test 8. . . . . . . . . . 70 A.1 The final validation loss and loss difference between ATP-ICE val. and AFP-DRY for both the GRU- and LSTM autoencoder using MAE and MSE as loss function. . . . . . . . . . . . . . . . . . . . . . I A.2 Hyperparameter importance for each model tuned using Optuna. . . II xii List of Tables 3.1 Factors affecting the estimated friction between tire and road surface. 9 4.1 Features of the vehicle data. . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Features of the slippery road alert data. . . . . . . . . . . . . . . . . 30 4.3 Features of the weather data from Trafikverket. . . . . . . . . . . . . 30 4.4 Features of the weather data from SMHI. . . . . . . . . . . . . . . . . 31 4.5 Features of the road data. . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Definition of true- and false positive slippery road alerts. . . . . . . . 38 5.2 Classes for the autoencoder data set. . . . . . . . . . . . . . . . . . . 44 5.3 Certain classes for active learning data set. . . . . . . . . . . . . . . . 45 5.4 Uncertain true positive classes for active learning data set. . . . . . . 46 5.5 Uncertain false positive classes for active learning data set. . . . . . . 46 5.6 Settings for autoencoder experiment 1. . . . . . . . . . . . . . . . . . 50 5.7 Hyperparameter used in autoencoder experiment 1. . . . . . . . . . . 50 5.8 Hyperparameters used in autoencoder experiment 2. . . . . . . . . . . 51 5.9 Hyperparameters tuned with Optuna for the autoencoders. . . . . . . 51 5.10 Initial hyperparameters for the classifiers . . . . . . . . . . . . . . . . 53 5.11 Settings for active learning experiment 1. . . . . . . . . . . . . . . . . 54 5.12 Settings for active learning experiment 2. . . . . . . . . . . . . . . . . 54 5.13 Hyperparameters tuned with Optuna for the classifier model. . . . . . 55 5.14 Settings for active learning experiment 3. . . . . . . . . . . . . . . . . 55 6.1 Results from experiment 1 - GRU-autoencoder . . . . . . . . . . . . . 58 6.2 Results from experiment 1 - LSTM-autoencoder . . . . . . . . . . . . 58 6.3 Autoencoder - Best hyperparameters . . . . . . . . . . . . . . . . . . 63 6.4 Proportion of remaining sequences for each class after filtering using threshold 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.5 Proportion of remaining sequences for each class after filtering using threshold 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.6 Results from active learning experiment 1. . . . . . . . . . . . . . . . 68 6.7 Results from active learning experiment 2. . . . . . . . . . . . . . . . 69 6.8 Active learning - Best hyperparameters. . . . . . . . . . . . . . . . . . 69 6.9 Results from active learning experiment 3. . . . . . . . . . . . . . . . 70 6.10 Proportion of remaining sequences for each class after filtering using the classifier models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 xiii List of Tables xiv List of Acronyms ABS Anti-lock Braking System. ADAS Advanced Driver Assistance System. AE Autoencoder. AFP Assumed False Positive. AL Active Learning. ANN Artificial Neural Network. ATP Assumed True Positive. DBSCAN Density-Based Spatial Clustering of Applications with Noise. GRU Gated Recurrent Unit. LSTM Long Short-Term Memory. MAE Mean Absolute Error. ML Machine Learning. MSE Mean Squared Error. RNN Recurrent Neural Network. SRA Slippery Road Alert. TCS Traction Control System. xv List of Acronyms xvi 1 Introduction Slippery road surfaces are a common cause of car crashes in northern countries. According to Trafa (Swedish government agency for traffic analysis), 33% of all road-traffic accidents reported during 2022 in Sweden occurred on wet or icy road conditions [1]. To increase vehicle and traffic safety, modern vehicles are equipped with Advanced Driver Assistance System (ADAS) that aims to assist drivers in dif- ferent driving situations, including difficult road conditions. A subsystem of ADAS is the slippery road alert (SRA) system that warns drivers of nearby slippery roads via a cooperative intelligent transport system (C-ITS) where data, including infor- mation about road slipperiness, is anonymously shared with nearby vehicles via a cloud-based solution. These warnings make the driver aware of a potentially slippery road ahead such that the vehicle can be manoeuvred to avoid slipping. Inaccurate estimations of the road slipperiness in this system can lead to either misses of slip- pery roads (false negatives) or false warnings (false positives). Hence, generating accurate slippery road alerts is significantly important for vehicle and traffic safety and the reliability of the system. 1.1 Problem Formulation In an SRA system, vehicles share friction data to the cloud where it is processed to generate SRAs that are sent to other nearby vehicles. Current implementations of SRA systems rely on the aggregation of estimated friction coefficients between tire and road surface based on measurements from in-vehicle sensors. In [2], the authors proposed a sensor data fusion approach for friction estimation based on the brush model and Kalman filter. Furthermore, a more accurate friction estimation can be made using systems such as the anti-lock braking system (ABS) and the traction control system (TCS). However, current implementations of SRA systems occasionally produce false positive alerts. Hence, drivers receive warnings of slippery roads when the road is not slippery. Reduction of false positive SRAs is important since it would increase the reliability and trustworthiness of the system which in turn can lead to drivers reducing the speed of the vehicle instead of ignoring the warnings. 1 1. Introduction 1.2 Aim and Objectives This thesis aims to reduce the number of false positive alerts generated by an SRA system based on two machine learning techniques: autoencoder and active learning. The proposed solution will be a false positive filter intended to remove sequences that correspond with false positive situations e.g. driving on a speed bump. The primary objectives of this thesis can be divided into two parts: First, the causes of potential false positive alerts will be investigated by analyzing the available data. This is to better understand in which situations the assumed false positive alerts commonly occur to motivate decisions taken in the study. Second, we aim to improve an SRA system by focusing on reducing the number of false positive alerts through the use of the previously mentioned machine learning approaches. 1.3 Contributions Numerous studies have investigated techniques to estimate the friction between tire and road surface. However, reducing false positive SRAs by the use of machine learning models solely on unlabeled and estimated friction data in combination with the weather- and road data has never been done before this study. In this thesis, we use two different machine learning approaches based on autoencoder and active learning to reduce the number of false positive SRAs. Both of the approaches will be evaluated using unlabeled data only. Furthermore, the study provides an analysis of potential causes of false positive alerts in an SRA system based on the available data. Finally, an approach for creating sequential data from spatial-temporal data points based on time series clustering and DBSCAN is proposed. 1.4 Limitations The limitations of this study were mostly related to the available data and the evaluation of the proposed solution. • Spatial- and temporal limitations in data: A data set was not available at hand and was therefore needed to be constructed during the project. The data sets used for training the machine learning models were gathered within a limited time frame, more specifically during winter and early spring. Hence, the generated data sets lack data from both summer and autumn where false positive alerts probably are easier to distinguish from true positive alerts. Additionally, the data was gathered from vehicles in Gothenburg, Sweden. • Feature limitations in data: The most important available features in the available data were the estimated friction coefficient and friction quality. Proper- ties of the vehicle such as its velocity, weight, and lateral- and longitudinal forces on the wheels, were not available but would probably be beneficial for detecting false positive alerts. • Unlabeled data: The available friction data was unlabeled meaning that ground 2 1. Introduction truths of true- and false positives were not available. Creating a labeled data set would be a time-consuming process and due to limited time for the project, this was avoided. Hence, unsupervised and semi-supervised methods were used and some of the choices made in this study were based on assumptions motivated by the literature study and data analysis. • Evaluation limitations: Evaluation of unsupervised machine learning models is generally challenging since there is no ground truth to compare with. Usually, at least a small set of labeled data is collected to have something to evaluate against. However, a labeled data set could not be created due to practical difficulties and time constraints. Hence, determining if the proposed solution reduces false positive alerts was not manageable. Instead, the evaluation of the proposed solution was mostly based on comparing the machine learning models on assumed true- and false-positive sequences using metrics that do not rely on ground truths. 1.5 Thesis Outline The rest of this paper is structured as follows: • Chapter 2 presents previous work and research related to this study. • Chapter 3 provides a theoretical background of the most important concepts and techniques used in this project. • Chapter 4 provides a description and analysis of the available data. • Chapter 5 describes the methods used and how the project was conducted includ- ing data processing, choice of architecture, development of the machine learning models, and finally the evaluation of the proposed solution. • Chapter 6 presents the results of the experiments and final evaluation of the autoencoder and active learning approach. • Chapter 7 discusses the results and concludes the thesis with a discussion of ethical- and legal considerations, insights on the topic, and future work. 3 1. Introduction 4 2 Related Work False positive alerts can be considered to be anomalies and there are numerous studies made on how to detect anomalies in time series. In [3], the authors discussed common complexities and challenges in the field of anomaly detection using deep learning techniques. First, some anomalies might be unknown until they occur which makes them difficult to prevent. Second, there might be heterogeneous anomaly classes i.e. anomalies with different root causes and abnormal characteristics. Third, anomalies are typically rare and only occur on special occasions which can lead to class imbalance. In [4], the authors pointed out that anomaly detection is different from traditional machine learning where the goal is to find similarities rather than dissimilarities in the data. Furthermore, the authors mentioned that the evaluation of anomaly detec- tion systems is crucial but difficult to perform and even claimed that it is even more challenging than developing the system itself. One of the most challenging aspects of evaluating anomaly detection systems is the lack of a labeled data set. Furthermore, in other fields of machine learning, there are often public data sets available that make the results directly comparable with other methods used in previous studies. In contrast, researchers in the field of anomaly detection are forced to create their own data sets which means that the results are not directly comparable with results from previous studies. For the reasons previously mentioned, anomaly detection is typically defined as an unsupervised learning task. In [5], k-means clustering was used to reduce anomalies in the field of computer networks. Using this approach, it was possible to achieve an anomaly detection rate of 82.92%. The use of clustering-based methods is typically less complex than machine learning models and can therefore be easier to implement. However, other methods might be more suitable than clustering depending on the available data. Autoencoders are commonly used as an unsupervised method for anomaly detection and can be applied to many different types of complex data including images and time series. In [6], autoencoders were applied to reduce false positives in a handgun detection system. The results showed that the autoencoder could reduce the number of false positives of detected guns by 78%. In the work [7], the authors trained LSTM-based autoencoders on time series for anomaly detection. The LSTM autoencoders achieved an accuracy rate of 87% in 5 2. Related Work detecting anomalies. A more recent approach for anomaly detection in time series involves the utilization of transformers, as proposed in [8]. The experimental results revealed an improve- ment in F1-score by up to 17% and a remarkable 99% decrease in training time compared to the baseline models. While transformer-based autoencoders share sim- ilarities with traditional RNN-based autoencoders, they offer advantages such as enhanced performance and execution time. However, it is important to note that these benefits come at the expense of a more intricate architecture, which may pose implementation challenges. Moreover, research efforts have also been directed toward the prediction of road states using various machine learning techniques. Multiple studies have explored approaches, utilizing data from external sensors such as GPS, accelerometers, and gyroscopes attached to vehicles. In [9], the authors employed unsupervised methods, including self-organizing maps (SOM) and k-means clustering in combination with the aforementioned sensor data to group roads based on their roughness and condi- tion. Conversely, in [10] the authors implemented a supervised learning approach to predict road obstacles using the same type of sensor data. These studies show the potential for using sensor data to directly estimate and identify road obstacles and conditions from data available in the vehicles. 6 3 Theory This chapter provides a theoretical framework of the key concepts of the project. Section 3.1 describes the fundamentals of friction and how it occurs between tire and road surface. Section 3.2 describes the most critical factors affecting the estimated friction coefficient. Section 3.3 describes important concepts within machine learning relevant to the study. Section 3.4 describes the DBSAN clustering algorithm used in the sequence generation process. Section 3.5 provides a theoretical background of artificial neural network, recurrent neural network, LSTM, GRU, and autoencoder. Finally, Section 3.6, describes spatial hashing which is a method that was used in the creation of the data sets. 3.1 Friction Between Tire and Road Surface Friction is the force that counteracts the relative motion between two surfaces that are in contact with each other. The coefficient of friction (COF), often written as µ, is defined as the ratio between the impressed force and the normal force of an object and ranges from 0 (no friction) to 1 (full friction). For a moving object e.g. a rotating wheel on a flat horizontal surface, the coefficient of friction is defined as the quotient between the horizontal friction force (commonly called rolling resistance) F and the vertical normal force FN [11]. µ = F FN (3.1) As shown in Figure 3.1, the friction force F from the ground acts in the opposite direction to the direction of motion of the wheel at the point of contact. On the vertical axis of the wheel, the normal force FN that is perpendicular to the road surface is the force the ground is acting on the wheel. Conversely, according to Newton’s third law of motion, the wheel is acting with an equal amount of force FW on the ground which is calculated by multiplying the mass with the acceleration due to gravity [11]. The friction force F between tire and road surface enables the vehicle to move forward and is essential for the maneuvering of the vehicle in both the lateral- and longitudinal directions. In general, higher friction means more control of the vehicle. In cases when the coefficient of friction is low, the vehicle starts to slip and the driver loses control of the vehicle. Hence, accurate estimation 7 3. Theory of the friction between tire and road surface is important for road manufacturers to improve road and traffic safety. Figure 3.1: Simplified visualization of the forces acting on a rotating wheel. 3.2 Estimating The Coefficient of Friction The coefficient of friction (COF) can be estimated for each wheel or as the average tire-road friction coefficient for all wheels of the vehicle. In the case of the latter, the coefficient of friction is assumed to be equal for all wheels. Conversely, for indi- vidual tire-road friction coefficient estimation, this assumption is not made [12]. In this thesis, the coefficient of friction between tire and road surface was estimated as the average friction for all wheels. Average friction coefficient estimation is accom- plished by using slip-slope methods, extended Kalman filter, or lateral-dynamics- based methods [12]. Slip-slope methods utilize the relationship between the normalized longitudinal forces acting on the wheels and the relative motion between tire and road surface, referred to as slip [12] to estimate the COF between tire and road surface. Tire slip has been proven to have a high correlation with the COF and can be described as the amount of wheel lock where 0% slip corresponds to a free-rolling wheel and 100% corresponds to a fully-locked wheel [13]. The coefficient of friction increases rapidly until the peak friction is reached which typically occurs when slip is 10-20%. Then, the coefficient of friction decreases slowly up to 100% slip, i.e. when the wheel stops rotating. To avoid sliding when braking on icy roads the relationship between the coefficient of friction and slip is used in the anti-lock brake system (ABS) to determine when it needs to activate [13]. The interaction between tire and road surface is a complex event that is affected by a variety of parameters including vehicle weight, slip, road condition, and surface texture. Additionally, surrounding environmental factors such as ice, snow, slush, water, and gravel have an impact on the friction that occurs between tire and road surface. Some of the most highly influential factors are divided into four categories presented in Table 3.1, modified from [13]. 8 3. Theory Table 3.1: Factors affecting the estimated friction between tire and road surface. Vehicle operating parameters Tire properties Road characteristics Environmental factors • Longitudinal motions ◦ Vehicle speed ◦ Braking action • Lateral motions ◦ Turning ◦ Overtaking • Tread design and condition • Inflation pressure • Rubber composition and stiffness • Road type ◦ Asphalt ◦ Brick ◦ Gravel • Unevenness ◦ Pothole ◦ Speed bumps • Weather ◦ Temperature ◦ Precipitation ◦ Humidity ◦ Wind speed • Contaminants ◦ Anti-skid materials (salt, sand) ◦ Dirt, mud, oil spill 3.2.1 Vehicle Operating Parameters The estimated friction coefficient between tire and road surface depends on how the vehicle is maneuvered. As previously mentioned, the friction coefficient is estimated based on the forces acting on the wheels. Since driving maneuvers such as accelerat- ing, braking, and turning affects these forces, the estimated friction coefficient will inherently depend on how the vehicle is maneuvered [13]. 3.2.2 Tire Properties When the wheel spins, the tire is deformed and recovered in repeated cycles due to the loaded weight. Hence, the friction force between the tire and the road surface has a hysteresis characteristic that is dependent on tire rubber stiffness [13]. The tire stores the deformation energy within the rubber and releases the energy in the form of a pushing force and heat resulting in a net friction force that contributes to the reduction in forward motion. Additionally, friction occurs due to adhesion which is the result of the small-scale bindings of the rubber molecules with the road surface. These types of molecular bonds are called Van der Waals bonds and are weak attractive forces lasting for a short amount of time. Thus, the total friction force F can be defined as the sum of both the hysteresis and adhesion friction forces FH and FA respectively as follows [13]: F = FH + FA (3.2) Both the hysteresis and adhesion force components depend on the properties of the tire itself but also on the characteristics of the road surface that are described in the next section. The hysteresis effect mostly occurs on the macro-level asperities and is the dominant component on wet and rough-textured road surfaces. Conversely, ad- hesion occurs on micro-level asperities and is dominant on dry and smooth-textured 9 3. Theory road surfaces [13]. 3.2.3 Road Characteristics The properties of the road surface have a direct effect on the estimated tire-road friction coefficient. The type of road surface affects the maximal friction achievable between the road and the tire. In [14], the authors studied how different road conditions affect the friction coefficient. On rough road, the friction coefficient was measured to be in the range of 0.8-0.9, whereas smooth road resulted in the range between 0.35-0.45. Finally, gravel and mud surfaces resulted in friction coefficients between 0.1 - 0.3. Hence, the type of road surface influences the friction coefficient regardless of other parameters. 3.2.4 Environmental Factors Both temperature and precipitation have an indirect effect on the friction between tire and road surface. Water on the road itself will reduce the friction between tire and road resulting in reduced traction and control over the car. In [15], it was shown that even a 0.05 mm layer of water on the road surface can result in up to 30% less tire-road friction when compared to dry conditions. More severe cases when thicker layers of water build up on the road surface result in an increased risk of hydroplaning where the friction coefficient can drop close to zero [13]. Snow and ice are the most hazardous environmental factor when it comes to driving. In the experiments of [14], the friction coefficient of ice surfaces was below 0.1. In cases when ice or snow blocks the direct contact between tire and road surface, any sudden changes like breaking or changing direction will cause the wheels to lock and slide [13]. 10 3. Theory 3.3 Machine Learning Machine learning (ML) is a sub-field in artificial intelligence (AI) where algorithms are used to find patterns in empirical data based on how humans learn [16]. The concept of learning in ML is accomplished by gradually improving the accuracy of predictions in an iterative process commonly referred to as training. The end product of a machine learning pipeline is a trained model capable of making predictions based on new data that the model has never been exposed to before [16]. ML has proven to perform well in a variety of tasks such as classification-, regression-, clustering-, and association tasks within many different fields including self-driving cars, banking and financial services, and healthcare. In recent decades, there have been tremendous advances in the field of artificial neural networks (ANNs), especially deep neural networks (DNNs) that have proven to outperform previously state-of- the-art methods in many fields [17]. Furthermore, depending on both the available data and the task at hand, machine learning can be divided into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning [18]. In this study, unsupervised- and semi-supervised learning methods were used and are therefore more thoroughly described next. 3.3.1 Unsupervised Learning Unsupervised learning is a machine learning technique in which algorithms are trained to find patterns in unlabeled data [18]. In contrast to supervised learn- ing where each data point is tagged with a target variable, unsupervised methods solely rely on capturing patterns in the underlying structure of the data [18]. La- beling a large amount of data can be both time-consuming and expensive but is often necessary to train models capable of making accurate predictions that reflect real-world situations. Additionally, evaluating the results of models trained in an unsupervised fashion can be challenging since there is no ground truth to compare with, especially in classification and regression tasks [19]. However, unsupervised learning is still used in a variety of tasks including data visualization, association, anomaly detection, dimensionality reduction, and data generation. Data clustering methods such as k-means, DBSCAN, and hierarchical clustering are common unsu- pervised learning techniques in which data points are clustered into groups based on some predefined similarity measure such as Euclidean distance, Haversine distance or cosine similarity [20]. For anomaly detection tasks, ANNs are commonly trained to find patterns in "normal data" such that anomalous data can be detected and removed [21]. An autoencoder is a type of generative ANN that was used in this study for anomaly detection and will therefore be described later in this chapter. 3.3.2 Semi-supervised Learning Semi-supervised learning methods use a combination of both labeled and unlabeled data, where the former commonly is smaller in size [18]. Examples of semi-supervised learning techniques are active learning, self-training, and co-training. Each of these techniques involves the model to either label or pseudo-label unlabeled samples 11 3. Theory from a larger unlabeled data set. Hence, a smaller amount of labeled data is needed which saves both time and resources. However, unlike in unsupervised learning, these models can still be trained in a supervised learning fashion using labels as ground truths to compute the loss. Active Learning Active learning (AL) is a semi-supervised machine learning approach used to create labeled data sets efficiently when annotating data is either difficult or expensive. There are two methods for implementing AL: pool-based and stream-based. In stream-based AL, a model is trained on an initial small labeled data set, and new samples from the unlabeled data set are then passed successively to the model that predicts a label. For certain predictions, the samples are added to the labeled data set. However, the more uncertain samples are instead labeled by the oracle (human labeler). Pool-based active learning involves additional steps including sampling data points and using a query to select new samples as shown in Figure 3.2. The core concept in pool-based AL is that a smaller labeled data set is used to train a model to sample data from the larger unlabeled data set using a query function. The new data samples are labeled using either the model or human annotators. Finally, the labeled samples are added to the labeled data set. The model is then trained using the extended data set and the cycle is repeated until a data set of sufficient size has been achieved [22]. Figure 3.2: Pool-based active learning. There are multiple methods for querying new samples in active learning including query by committee, uncertainty-based sampling, and expected model change/error reduction. The committee sampling methods score the samples using multiple mod- els trained on the labeled data. There are several uncertainty-based methods, but the main goal is to efficiently select samples that the model struggles to classify accurately and instead manually label them. Some examples are margin-based sam- pling, entropy sampling, and least confidence sampling [23]. Finally, expected model change and expected error reduction are both similar and computationally heavy 12 3. Theory methods. The change or error reduction is calculated when adding a new sample, leading to many additional computations [24]. 3.3.3 Metrics Metrics are used to measure the performance of machine learning models. For classification tasks, some of the most widely used metrics are accuracy, F1 score, precision, and recall. Accuracy can be used to assess the overall performance of the model. For binary classification tasks, accuracy can be defined as follows: Accuracy = TP + TN TP + TN + FP + FN (3.3) where: TP = True positives TN = True negatives FP = False positives FN = False negatives However, accuracy might be misleading if the data set is imbalanced i.e. when there are more instances of some class. Additionally, accuracy is not a good measure of the model’s performance if there are different costs for false positives and false negatives. Therefore, other metrics such as F1-score are usually more useful than accuracy. F1- score is computed based on the precision and recall of a model. Precision measures the proportion of positive predictions that were correct and is computed as follows: Precision = TP TP + FP (3.4) In contrast, recall measures the completeness of positive predictions and is computed as follows: Recall = TP TP + FN (3.5) Finally, with precision and recall, the F1-Score is computed as follows: F1-score = 2 · Precision · Recall Precision + Recall (3.6) 13 3. Theory 3.3.4 Hyperparameter Tuning Hyperparameter tuning is essential when optimizing the performance of a machine learning algorithm. A hyperparameter is a parameter that directly affects the learn- ing process including batch size, learning rate, and number of epochs. The optimal hyperparameters for a model differ depending on the data that the model is being trained on. Furthermore, the goal in hyperparameter optimization is to find a com- bination of hyperparameters that yields the optimal value of an objective function e.g. loss or accuracy [25]. The most common approaches for hyperparameter tuning are grid search and ran- dom search. Grid search is an exhaustive search method in which the best-performing model is assessed by training the model on each combination of provided hyperpa- rameters. In random search, the combination of hyperparameters is randomized which in some cases can lead to more optimal hyperparameters with less number of iterations. Even though both grid- and random search often can be computed in parallel, they are time-consuming methods, especially for large numbers of hyperpa- rameters [25]. Another approach for hyperparameter tuning is to use hyperparameter optimization frameworks such as Optuna which is designed to find the most optimal parameters automatically and efficiently. A benefit of using Optuna is that the search space is dynamically constructed using a sampling strategy. Both grid search and random search can be used as the sampling method in Optuna. Additionally, other model- based sampling methods available in Optuna are tree-structured Parzen estimator (TPE), Gaussian processes (GP), and covariance matrix adaptation evolution strat- egy (CMA-ES). In this thesis, the TPE sampling method was used which is based on independent sampling of previously evaluated hyperparameters and is known to generally perform well. Another benefit of using Optuna is that trials can be ter- minated if they do not seem promising. This means that more time can be spent on better trials during the search for the optimal hyperparameters. Finally, after the hyperparameter tuning process has finished, Optuna has the option to extract the most important hyperparameters that had the most impact on minimizing or maximizing the objective value [26]. Additionally, Optuna can handle multiple objectives during the hyperparameter tuning process. In multi-objective optimization, it is common to end up with more than one optimal combination of hyperparameters and trade-offs must therefore be made. Finally, with Optuna it is possible to simply extract the most optimal trials for multiple objectives by computing the Pareto front [27]. 14 3. Theory 3.4 DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a clustering algorithm used to group spatial data points by how densely packed they are [28]. Points that are closely packed together will form a cluster while other points located in low-density regions will be classified as outliers. Hence, DBSCAN can detect outliers and handle noisy data well unlike other famous clustering algorithms such as K-means. The main idea of DBSCAN is to cluster the given data based on a concept called ϵ-neighbour that is defined as follows: Definition 1 Let D be a data set containing points. Two points p, q ∈ D are neighbours if their distance is smaller than the radius ϵ around p: Nϵ(p) = {q ∈ D | d(p, q) < ϵ} where d is the distance measure. Examples of distance measures are Euclidean-, Manhattan-, and Haversine distances. The latter is defined as the angular distance between two points (longitude and latitude) on the surface of a sphere. Unlike Euclidean- and Manhattan-distance, the Haversine distance takes the curvature of the sphere into account. In DBSCAN, the radius ϵ is a value that determines the maximum distance between two points for one of the points to be considered as a neighbor to the other point. Choosing a too-small epsilon would lead to many points being classified as outliers. Conversely, a too- high epsilon value would result in merged clusters that should have been separated. Another tunable parameter in DBSCAN is a value determining the minimum number of points within the radius ϵ needed for the center point to be considered as a core point. Figure 3.3: How DBSCAN works. DBSCAN starts by defining core points as shown in Figure 3.3. Core points are used to extend the clusters further. Ultimately, when all core points have been assigned a cluster, non-core points or border points that are within the ϵ radius are added to the cluster. Border points can only be added to clusters and not be used to add new points to the clusters. Finally, the remaining points are considered to be outliers since they are not close to any core point. 15 3. Theory 3.5 Artificial Neural Network An artificial neural network (ANN) is a machine learning algorithm inspired by the biological neural network that exists in the human brain [29]. Similar to the human brain, ANNs are built upon a set of interconnected nodes/neurons that send and receive signals in a complex system. In an ANN, these signals are real numbers that are being processed by the nodes to form an output that is computed by a non-linear function, called activation function. The simplest ANN is a single-layer perceptron consisting of multiple inputs and an output, as shown in Figure 3.4. Figure 3.4: A single layer perceptron with three input variables and one output. A perceptron with n input variables is a type of linear classifier that computes the output ŷ in two steps. First, each input variable x1, x2, . . . , xn is multiplied with its corresponding weight w1, w2, . . . , wn, as shown in Equation 3.7. z = n∑ i=1 xiwi + b (3.7) Additionally, a bias term b is added to this weighted sum that serves as an additional parameter that can be tuned to improve the performance of the perceptron. Second, the weighted sum is passed through an activation function f that determines if the node should be activated or not, as shown in Equation 3.8. ŷ = f(z) = 1 if z > 0, 0 otherwise (3.8) Common activation functions are sigmoid, tanh, and rectified linear unit (ReLU) [30]. In this simple example, the activation function is a function that activates if the input is positive and deactivates if the input is negative. Hence, the output of this perceptron will be binary (0 or 1). The single-layer perceptron can handle linear classification tasks but struggles when the data is not linearly separable [29]. Therefore, a multi-layer perceptron (MLP) also referred to as a feed-forward neural network (FNN) was developed with the idea 16 3. Theory Figure 3.5: A multi-layer perceptron. of creating a non-linear model by combining multiple single-layer perceptrons in a network. An MLP is a fully-connected neural network meaning that each neuron in one layer is connected with each neuron in neighbouring layers, as shown in Figure 3.5. A connection between two neurons enables the output signal from one neuron to be passed as input to the next neuron. The layers can be divided into an input layer, hidden layers, and an output layer. Both the input- and output layers contain only a single layer. However, between these layers are the hidden layers that can contain any number of layers and neurons. When calculating the depth of an ANN, only the layers with adjustable weights are considered, i.e. the hidden layers and output layer. Increasing the number of hidden layers increases the number of parameters and complexity of the network. Hence, training a network with many layers will be more time-consuming and the network might overfit to the training data and thus perform worse when exposed to new data. Conversely, a network containing too few hidden layers will struggle to learn the underlying representation of the data. ANNs containing many layers are referred to as deep neural networks (DNNs) and will be discussed later in this chapter. The network shown in Figure 3.5, has four neurons in the input layer for each input. The input data is then passed to the two hidden layers containing three neurons each. Finally, the signals from the last three neurons in the last hidden layer are passed to the output layer containing two neurons to form the output. Similar to humans, ANNs learn by making initial guesses and improving these guesses when new information is given and processed. The goal of training an ANN is that it should learn how to make accurate predictions by finding the optimal weights and biases for each neuron in the network. Training an ANN is an iterative process that consists of two main concepts: forward- and backpropagation. In for- ward propagation, the input data is passed forward to each neuron in the network. As previously mentioned, the outputs of each neuron in the network are computed using an activation function that suppresses the input values to a bounded range 17 3. Theory Figure 3.6: Three common activation functions. such that the problem of exploding gradient is prevented. The most commonly used activation functions are sigmoid, tanh, and ReLU as shown in Figure 3.6. The sigmoid activation function is a non-linear function that transforms the input to a value between 0 and 1, and is thus suitable for producing probabilities [30]. Since many loss functions expect the input values to be in the range of 0 to 1, the sigmoid activation function is commonly used for neurons in the output layer. The tanh function is another non-linear function that pushes the input values to the range -1 to 1. Both the sigmoid and tanh activation functions are differentiable everywhere meaning that both activation functions can be used for gradient-based backprop- agation. However, both functions can cause the vanishing gradient problem [30]. This means that the weights and biases are updated with small changes resulting in slow learning. The ReLU activation function is commonly used in the hidden layers instead of sigmoid and tanh to fix this problem. Unlike sigmoid and tanh, ReLU is not differentiable at zero. This means that a sub-derivate is needed in the backpropagation when using ReLU as an activation function. Additionally, ReLU introduces sparsity in the hidden layers which have been proven to be beneficial in ANNs and are achieved when some values are negative and converted to zero by the ReLU function. However, ReLU suffers from the dying ReLU problem that occurs when too many activations get below zero resulting in the deactivation of neurons and thus a reduction in learning [31]. The final output from the neurons in the output layer is a predicted vector ŷ that is compared with the target vector y by using a loss function. The loss is a measure of how far the predicted value is from the target and the choice of loss function depends on the type of task the network tries to solve e.g. classification or regression. For binary classification tasks, the binary cross entropy (BCE) is commonly used and is computed as follows [32]: LBCE = − 1 n n∑ i=1 (yi · log(ŷi) + (1 − yi) · log(1 − ŷi)) (3.9) The BCE loss function expects the predicted values ŷi and target values yi to be inter- preted as probabilities ranging from 0 to 1. Additionally, BCE loss is differentiable at all points making it suitable for gradient computations. Another commonly used loss function for binary classification tasks is the hinge loss defined as follows [33]: 18 3. Theory LHINGE = 1 n n∑ i=1 max(0, 1 − (ŷi · yi)) (3.10) Similarly to BCE, the hinge loss function also expects the predicted values to be between 0 and 1. However, the hinge loss is not differentiable at one meaning that sub-derivatives are needed. For regression tasks, mean absolute error (MAE) also known as L1 loss is a commonly used loss function and is calculated by taking the absolute value of the difference between the predicted- and target values as follows [34]: LMAE = 1 n n∑ i=1 |yi − ŷi| (3.11) Another loss function for regression tasks that is similar to MAE is the mean squared error (MSE) that is computed by taking the square of the difference between the predicted- and target values [35]. LMSE = 1 n n∑ i=1 (yi − ŷi)2 (3.12) Unlike MAE, MSE penalizes large errors since the difference is squared. If many predicted values result in low losses, then MSE is preferable. However, if the model makes a few poor predictions, the MSE loss will magnify the error due to the squaring of the difference. This will result in an overall high error using MSE even though the majority of predictions were accurate. In that case, MAE might be the better alternative. The goal of training an ANN is to reduce the loss by updating the weights and biases of the network during backpropagation. Finding the smallest loss is an optimization problem in multi-dimensional space tackled by the use of an optimizer. Most of the optimizers are based on gradient descent which is an algorithm that converges to a local minimum of the loss function. Some of the most well-known optimizers are stochastic gradient descent (SGD), root mean squared propagation (RMSprop), and adaptive moment estimation (Adam) [36]. During backpropagation, the weights wi and biases b of the network are updated using gradient descent as follows [37]: wi = wi − η ∂L ∂wi b = b − η ∂L ∂b (3.13) The learning rate η is a tunable hyperparameter that determines the size of each step taken towards a local minimum of the loss function in gradient descent [37]. A too high learning rate might result in overshooting the local minimum. Conversely, a too low learning rate can result in slow convergence or possibly being stuck at an undesirable local minimum. Commonly, the learning rate is set to a value between 1 × 10−3 and 1 × 10−5 but needs to be tuned for optimal performance. 19 3. Theory 3.5.1 Recurrent Neural Network A recurrent neural network (RNN) is a type of artificial neural network (ANN) that handles sequential data as input [38]. Similar to vanilla ANNs, RNNs consist of an input layer, hidden layers, and an output layer. However, traditional ANNs do not take the order of the input data into account and can therefore not find temporal or structural patterns in sequences. Therefore, RNNs use an additional memory mechanism in the form of a feedback loop that allows the model to reuse past information in the computations of the next output [38] as shown in Figure 3.7. RNNs can be used for different types of sequence modeling tasks including sentiment classification, text generation, image captioning, translation, or forecasting. For classification tasks, the output of the RNN will not be a sequence but rather a vector of probabilities. For sequence generation tasks, commonly referred to as sequence-to-sequence or seq-to-seq, the relationship between the input and output is many-to-many since both the input and output are a sequence [39]. Figure 3.7: The unrolling of a recurrent neural network. The RNN is unrolled across time into a full network for each element in the sequence. An RNN cell is similar to a neuron in a feed-forward neural network in that it computes an output vector by the use of an activation function on the weighted sum of the inputs. However, in addition to the output vector, RNN cells produce a hidden state ht that is passed to the next RNN cell in the network such that the output is influenced by both current and previous values. The weights and biases are shared across time meaning that the number of parameters does not change with increasing number of unrolls of the network. During forward propagation of the network, the hidden state ht and output yt is updated for each time step t as follows [40]: zt = Wht−1 + Uxt + b ht = tanh(zt) ot = V ht + c ŷt = softmax(ot) (3.14) 20 3. Theory where: U = input-to-hidden weight matrix V = hidden-to-output weight matrix W = hidden-to-hidden weight matrix b, c = bias vectors During the backpropagation of an RNN, the gradients flow backward for each time step and across all time steps starting from time t. This process is referred to as backpropagation through time (BPTT) [38]. The gradients are computed with respect to each weight matrices U ,V ,W , and bias vectors b and c. Each parameter is then updated similarly to a regular ANN. Due to the unrolling of the RNN, the network usually becomes deep, resulting in many repeated gradient computations. This can in turn result in vanishing or exploding gradients, i.e. gradients becoming too small or too large respectively [38]. Hence, the network will not be able to update the weights and biases properly and the learning will be reduced or stopped entirely. In other words, vanilla RNNs suffer from short-term memory and struggle to carry information from earlier time steps, especially for long sequences. A solution to exploding gradients is gradient clipping where large gradients are clipped back to a feasible scale. Vanishing gradients is a more common problem in RNNs and can somewhat be mitigated by choosing a suitable activation function, initializing the weights, or changing the network architecture [38]. 3.5.2 Long Short-Term Memory As previously described, vanilla RNNs are commonly facing the vanishing- and ex- ploding gradient problem that inherently prohibits the network from learning. To counteract this issue, long short-term memory (LSTM) cells were introduced with the idea of including both long- and short-term memory when computing the outputs of each cell in the RNN [41]. The long-term memory of an LSTM cell is represented by the cell state c that flows through the top of the LSTM cell in Figure 3.8. In contrast, the short-term memory is reflected in the hidden state h. The main idea of LSTMs is that the gradients for the cell states neither vanish nor explode since they are not directly computed w.r.t the weights. LSTMs are more complex compared to regular RNNs but are better at learning the inter-dependencies in sequential data [41] through the use of gates. The first gate of an LSTM cell is the forget gate where the input vector xt at the current time step t and the previous hidden state ht−1 is passed through a sigmoid activation function that outputs a value ranging from 0 to 1 (probability) as follows: ft = σ(Wf [ht−1, xt] + bf ) (3.15) The first hidden state is usually initialized by setting each value to zero. The output from the sigmoid activation function is then multiplied with the previous cell state ct−1. Hence, the output of the sigmoid activation function determines the amount of long-term memory that should be forgotten [41]. 21 3. Theory Figure 3.8: Long short-term memory (LSTM) cell. Next, in the input gate, the previous hidden state and current input are passed to both a sigmoid- and a hyperbolic tangent activation function as follows: it = σ(Wi[ht−1, xt] + bi) c̃t = tanh(Wc[ht−1, xt] + bc) (3.16) The latter is used to compute potential long-term memory while the former is used to decide the percentage of long-term memory to be added to the cell state. The current cell state ct will be updated based on the outputs from both the forget gate and input gate as follows [42]: ct = ft ⊙ ct−1 + it ⊙ c̃t (3.17) The final gate in an LSTM cell is the output gate where the values of the current cell state ct are passed to a hyperbolic tangent function to compute potential short- term memory. Similar to the other gates, the sigmoid activation function is used on the previous hidden state and the current input to compute a probability that determines the amount of short-term memory to be passed to the next LSTM cell as follows [42]. ot = σ(Wo[ht−1, xt] + bo) ht = ot ⊙ tanh(ct) (3.18) 3.5.3 Gated Recurrent Unit A gated recurrent unit (GRU) is a type of RNN that is similar to LSTM in that gates are used to update the hidden states of the network for each time step. GRUs are 22 3. Theory less complex compared to LSTMs which makes them less computationally expensive while still maintaining a high performance comparable to the performance of LSTMs. Unlike LSTMs, GRUs consists of only two gates and do not use the internal cell state, i.e. the long-term memory state. Hence, in theory, LSTM networks should outperform GRUs in cases when the input sequences are long [43]. Figure 3.9: Gated recurrent unit (GRU) cell. A GRU cell consists of a reset- and update gate as shown in Figure 3.9. The input vector xt and previous hidden state ht−1 are first passed to the sigmoid function in the reset gate as follows [43]: rt = σ(Wr[ht−1, xt] + br) (3.19) The sigmoid function outputs a value between 0 and 1. This value is used later in the update of the current hidden state ht and determines how much of the previous information should be forgotten. Next, the input vector xt and previous hidden state ht−1 are passed to the sigmoid function in the update gate where the output is computed as follows [43]: zt = σ(Wz[ht−1, xt] + bz) (3.20) This value is used to update the hidden state. The candidate hidden state h̃t is computed using a hyperbolic tangent function. The output value from the sigmoid function in the reset gate rt is multiplied with the previous hidden state ht−1. This product is used in the weight matrix multiplication. Additionally, the input vector xt is also multiplied with the weight matrix. The bias term is added and the entire matrix is passed to the hyperbolic tangent function that outputs the candidate hidden state h̃t as follows [43]: 23 3. Theory h̃t = tanh(Wh[rt ⊙ ht−1, xt] + bh) (3.21) Furthermore, to compute the current hidden state ht, the candidate hidden state is multiplied with the output from the sigmoid function in the update gate zt. The opposite or complement probability of zt is computed and multiplied with the pre- vious hidden state ht−1. The sum of these element-wise products forms the current hidden state as follows [43]: ht = (1 − zt) ⊙ ht−1 + zt ⊙ h̃t (3.22) 3.5.4 Autoencoder An autoencoder (AE) is an unsupervised learning technique based on an artificial neural network that is trained to reconstruct high-dimensional input data. Au- toencoders can be used in a variety of applications including feature extraction, dimensionality reduction, image generation, and anomaly detection. Autoencoders for anomaly detection is a special use case where the reconstruction error or loss is used to classify and filter out anomalies [3]. Some examples of applications where autoencoders are used for anomaly detection are [44], [7], [6]. The structure of an autoencoder consists of three main parts: encoder, bottleneck, and decoder, as shown in Figure 3.10. The encoder can be viewed as a function g(x) that can transform the input data x into a latent vector representation z. The encoded vector z is located in the bottleneck of the network and is a compressed representation of the input data. This vector is then up-scaled to the same dimension as the input data in the decoder part f(z) of the network. Ideally, after training the autoencoder, the input x should be identical to the reconstructed output x̂ [45]. Figure 3.10: The structure of an autoencoder. 24 3. Theory As previously mentioned, autoencoders can be used for anomaly detection tasks. The main idea is to train the autoencoder on normal data that contains a majority of normal data points. Ideally, this data set should not include any anomalies. How- ever, this can be challenging to achieve, especially when the causes of the anomalies are unknown. Furthermore, the autoencoder will encode the normal data into a latent vector representation that is decoded back to its original dimension. The au- toencoder is trained to minimize the reconstruction loss such that the reconstructed output will be similar to the input after training. Hence, smaller reconstruction losses will in general correspond to reconstructed outputs that are more similar to the input. After training, the autoencoder should have learned the latent distribu- tion of the input data such that when anomalous data is inputted, the reconstruction loss will be higher compared to when normal data is inputted. A threshold on the reconstruction loss can therefore be used to classify if the inputted data is normal or anomalous. Autoencoders for anomaly detection tasks can be summarised in the following four steps: 1. Train the autoencoder on normal data. 2. Feed normal data as input to the trained autoencoder and compute the distribu- tion of the reconstruction loss. 3. Select a threshold on the reconstruction loss e.g. such that the majority of normal data is being classified as normal. 4. Test and evaluate the trained autoencoder with the given threshold by observing the number of anomalies being detected. One of the most important hyperparameters for autoencoders is the embedding size that determines the size of the compressed output vector of the encoder. The embedding size can have a large impact on the model’s ability to distinguish false- from true positive sequences. A too large size of the embedding will enforce the autoencoder to learn the identity function such that the model will generate low losses for any input and thus make the model unreliable in detecting anomalies [46]. Conversely, a too small embedding size will make it too difficult for the model to reconstruct the data since too much information is lost. Traditionally in machine learning, the objective of hyperparameter tuning is to min- imize the validation loss or maximize the accuracy. However, finding the most optimal hyperparameters for autoencoders in the setting of anomaly detection is a non-trivial task because the loss is not a rigorous measure of the performance on its own. For example, increasing the embedding size will result in the input data not being compressed much and the decoder will easily decode the latent vector representation and reconstruct the input data such that the reconstruction loss will be small for all inputs. However, the autoencoder will struggle to find and learn patterns specific to the data it is being trained on and thus will not perform well in detecting anomalies [46]. 25 3. Theory 3.6 Spatial Hashing Spatial hashing is a method that is used to reduce the number of comparisons needed to find neighbouring points [47]. Assume the goal is to map each point pi from the data set D1 = {p1, p2, . . . , pn} with each point qj from the data set D2 = {q1, q2, . . . , qm} based on their spatial relationship. A naive solution would be to compare the locations of every point with each other. This results in a time complexity of O(nm), where n and m are the numbers of points in D1 and D2 respectively. Spatial hashing is a significantly more efficient solution, especially with an increasing number of points in the data sets. The main idea with spatial hashing is to reduce the number of comparisons needed to find neighbouring points in two steps. First, each point pi ∈ D1 is assigned to cells in a grid using a hash function as shown in Figure 3.11. The hash function can be used to e.g. map latitudinal- and longitudinal coordinates to cell indices in 2D or 3D space [47]. Figure 3.11: Spatial hashing. Furthermore, to find neighbouring points of a point qi ∈ D2, the point is hashed to a cell index using the same hash function. The distance is computed between qi and each point from D1 that was assigned to adjacent cells, including the cell to which qi was assigned. Hence, instead of computing the distance between every point as in the naive solution, the point qi is only compared with points in neighbouring cells. The cell size influences the number of points that are mapped to the same cell and is a parameter that needs to be tuned for optimal performance. A too large cell size would result in too many points being hashed into the same cell and thus increase the number of comparisons needed. Conversely, a too small value would lead to points spanning too many cells which can result in points not being compared as they could get grouped into grids further apart [47]. The time complexity of spatial hashing depends on how the points are disturbed in the cells. In cases when all points are hashed to unique cells, the complexity is O(1). The worst case is when all points are assigned the same cell which results in the time complexity of O(nm), which is the same as for the naive approach [47]. 26 4 Data Description and Analysis This chapter provides a description and analysis of the available data for this project. Section 4.1 presents the available data in this study which originated from three sources: Volvo Cars Corporation (VCC), Trafikverket, and SMHI. This is followed by an analysis of the data in Section 4.2. Since the data was unlabeled, the analysis was essential to find potential root causes of false positive alerts and to justify choices made in the methodology. 4.1 Available Data The available data used in this study is divided into the following four sets: vehicle data, slippery road alert (SRA) data, weather data, and road data. The vehicle- and SRA data originated from Volvo Cars Corporation (VCC) while the weather- and road data were requested and gathered from Swedish Transport Administration (Trafikverket). Additionally, the weather data used in the analysis was gathered from Swedish Meteorological and Hydrological Institute (SMHI). 4.1.1 Vehicle Data The vehicle data was gathered from the cloud for 10 weeks (72 days) between the 12th of January and to 24th of March 2023 during winter and early spring in Gothen- burg, Sweden. Ultimately, after the data gathering process, the vehicle data con- tained approximately 115 million data points. Due to GDPR, the vehicle data was anonymized such that unique vehicles could not be identified. The most important features of the vehicle data that were used in this thesis are presented in Table 4.1. This data contained bursts of data points sent from Volvo vehicles across Gothenburg and was transformed into sequences in the preprocessing step that is described in Chapter 5. The sequential vehicle data was then used to train and evaluate the machine learning models. 27 4. Data Description and Analysis Table 4.1: Features of the vehicle data. Feature Abbreviation Data type 1 Friction coefficient FC float 2 Friction quality FQ int 3 Creation time CT timestamp 4 Coordinate COORD geolocation 5 Temperature Tv float 6 Wiper speed WS int 7 Vehicle identification number VIN string 8 Software versions SV string 9 Road max speed RMS int 1. Friction coefficient The estimated friction coefficient was the main feature of interest in this thesis and is a value ranging from 0 to 1, where 0 represents no friction and 1 represents maximum friction. Friction coefficients below or equal to 0.35 are defined as low friction while friction coefficients above 0.35 are defined as high friction. This estimated value is based on the longitudinal brush model, lateral tire-force model, and limit handling measurement based on the ABS and traction control system. 2. Friction quality The friction quality is an integer number ranging from 4 to 7 indicating the quality of the estimated friction coefficient. The calculations of this value are based on the variance of slip and a multiple of in-signals to the vehicle such as the lateral- and longitudinal forces on the wheels. 3. Creation time The creation time represents the time in which the friction data was created on the database and is expressed by the date and time ranging from hours to milliseconds. 4. Coordinate The coordinates in the friction data correspond to the longitude- and latitude coor- dinates of each data point. 5. Temperature The temperature measurement originates from the vehicle temperature sensor and is represented in Celsius degrees. 6. Wiper speed The wiper speed is an integer value ranging from 0 to 6 indicating the speed of the car wiper. This value correlates with the precipitation. However, since the wiper speed can be set manually, it is not an accurate nor reliable indicator of precipitation. 28 4. Data Description and Analysis 7. Vehicle identification number The vehicle identification number (VIN) is a unique identification number for each vehicle and was only used for creating the sequential data. This number was removed and was not present in the data sets due to GDPR. 8. Software versions Software versions represent multiple versions of the software in the electronic control unit (ECU). 9. Road max speed The maximum legal speed for the road ranges from 5 km/h to 120 km/h. Figure 4.1 shows a sample of data points from the vehicle data plotted on a map. Each data point corresponds to a row in the vehicle data with each of the previously mentioned features as columns. As can be seen in the upper right corner, vehicles in motion send bursts of data points to the cloud resulting in the formation of sequences with both temporal and spatial properties. This was an important discovery that was exploited in the creation of the data sets for the machine learning models. Figure 4.1: Sample of data points from the vehicle data plotted on a map. 29 4. Data Description and Analysis 4.1.2 Slippery Road Alert Data The slippery road alert (SRA) data consisted of both location and time of specific SRAs generated in Gothenburg, Sweden within a year from 25th of March 2022 to 24th of March 2023. This data was only used in the analysis to find root causes of when potential false positive SRAs can occur. Furthermore, the SRA data was previously aggregated meaning that unique vehicles could not be identified in this data. The features of the SRA data are shown in Table 4.2. The start- and end times are expressed by the date and time ranging from hours to milliseconds. The locations of the SRAs are indicated by the longitudinal- and latitudinal coordinates. Table 4.2: Features of the slippery road alert data. Feature Data type 1 Start time timestamp 2 End time timestamp 3 Location coordinate 4.1.3 Weather Data The weather data used to create the data sets was requested from Trafikverket and consisted of both temperature and precipitation measurements from 11 weather stations located near roads scattered across Gothenburg, Sweden. The features of this weather data are shown in Table 4.3. The station id was an integer value corresponding to the 11 weather stations. The road temperature was measured in Celsius degrees and the road precipitation level was measured in millimeters. The time interval of this data was 30 minutes. This data was used to filter the vehicle data to form the data sets used for training and testing the machine learning models. Additionally, this data was also used to train some of the models using the active learning approach. Table 4.3: Features of the weather data from Trafikverket. Feature Abbreviation Data type 1 Station id SI int 2 Date and time DT timestamp 3 Road temperature Tr float 4 Road precipitation Pr float The weather data used in the data analysis originated from SMHI. This data was used since the weather data from Trafikverket was not available at the time when the analysis was conducted. Additionally, the weather data from SMHI was daily based and computed using more sophisticated algorithms than just computing the mean for each day, which would be the alternative if the weather data from Trafikverket were to be used instead. The temperature was measured in Celsius degrees while the 30 4. Data Description and Analysis precipitation levels were measured in millimeters. The data from SMHI originated from a weather station in Gothenburg and was gathered via SMHI’s publicly open website. The features of the weather data from SMHI are presented in Table 4.4. Table 4.4: Features of the weather data from SMHI. Feature Data type 1 Date and time timestamp 2 Daily air temperature float 3 Daily precipitation float 4.1.4 Road data The road data was gathered from Trafikverket’s site Lastkajen [48] within the area of Gothenburg. Lastkajen is a web-based application open to the public where one can order and download Sweden’s road and railway data. For this thesis, the most valuable road data was speed bump locations, and road types as shown in Table 4.5. The speed bump locations consisted of single longitudinal- and latitudinal coordi- nates for each location of speed bumps whereas the road types were represented by polygons of multiple longitudinal- and latitudinal coordinates. Table 4.5: Features of the road data. Feature Data type 1 Speed bump location coordinate 2 Asphalt road coordinates 2 Gravel road coordinates 31 4. Data Description and Analysis 4.2 Data Analysis An initial data analysis of the unprocessed data from VCC and Trafikverket was conducted for three main reasons. First, since the available data was unlabeled, this initial analysis was essential to get an intuition of in which circumstances true- and false-positive warnings might occur. A hypothesis was that most of the generated alerts during a year are true positives and occur when the probability of ice formation on the road surfaces is high, i.e. when the precipitation levels are high and the temperature is below 0 ◦C. Conversely, some false positive alerts might occur when driving on speed bumps or when making a sharp turn for example. Second, the insights from this analysis were used to motivate the choices made in this project including how the raw data was processed and how the models were constructed. Third, this initial analysis might reveal correlations between features and limitations of the data that need to be taken into consideration in the creation of the final data sets used to train and test the machine learning models. Ultimately, the goal of this initial data analysis was to answer the following questions: • Are there seasonal patterns in the number of generated slippery road alerts? • What are the root causes of true- and false-positive slippery road alerts? • In which weather conditions are low- and high values of the estimated friction coefficient sent to the cloud? The majority (56%) of the alerts were generated during spring (March to May), as shown in Figure 4.2. This was somewhat unexpected since winter is commonly more associated with slippery roads. However, further investigation into the data reveals that the maximum number of alerts occurred during a week in spring when it snowed heavily and the temperature was below 0 ◦C. Furthermore, the number of alerts during winter was 34% while 7% occurred during autumn and only 3% occurred during summer. Figure 4.2: Number of generated alerts for each season within a year. Most of the generated alerts during spring and winter are believed to be true pos- 32 4. Data Description and Analysis itives caused by ice formation on the road surfaces. Hence, finding false positive alerts during these seasons can be challenging. On the contrary, most of the alerts generated during autumn and summer are probably false positives. However, both true- and false positives can occur in any season. For example, some true positives during summer and autumn could be caused by hydroplaning. Conversely, during spring and winter, hard braking or driving on a speed bump can potentially cause false positive alerts. The vehicle data used to create the data sets for the machine learning models were gathered during late winter and early spring. Hence, the number of generated alerts from the SRA data during this period is of high interest. Figure 4.3 shows the num- ber of generated alerts (orange) against the air temperature (grey) and precipitation levels (blue) for each day from 2023-01-12 (winter) to 2023-03-24 (early spring). Figure 4.3: Number of generated slippery road alerts in Gothenburg during the project (late winter and early spring). As expected, when the daily air temperature dropped below 0 ◦C and the precipi- tation levels were above 0 millimeters, then the number of alerts increased. Note that the daily air temperature dropped below 0 ◦C four times within this period. The first two drops occurred on the 22nd of January and 27th of January. Even though the daily air temperature was below 0 ◦C, none or few alerts were generated. The reason for this is probably because of no precipitation. The next drop in air temperature below 0◦C occurred on the 4th of February when the precipitation lev- els were above 0 millimeters. As expected, this increased the number of generated alerts. Similarly, the number of generated alerts increased in the final drop in air temperature which occurred around the 8th of March. The precipitation levels were above 0 millimeters during this period as well. The maximum number of alerts (approximately 130 000) occurred on the 7th of March. In general, when the daily air temperature was above 0 ◦C, the number of generated alerts was few. Some of these alerts might be false positives. An increase in precipitation levels during those days did not seem to drastically increase the number of generated alerts. The number of daily generated alerts during summer 2022 in Gothenburg is shown 33 4. Data Description and Analysis Figure 4.4: Number of daily generated alerts together with the daily air tempera- ture and precipitation during summer 2022 in Gothenburg. in Figure 4.4. As previously mentioned, the number of generated alerts during sum- mer was much lower compared to alerts generated during winter and spring. High precipitation levels and air temperatures above 0 ◦C increase the risk of hydroplan- ing which could be an explanation for the increase of generated alerts on the 25th of July and 4th of August. Most of the generated alerts during days with no or a small amount of precipitation are probably false positive alerts. There are many explanations for these false positives including hard breaking, rapid acceleration, or driving on speed bumps. To investigate this further, the locations of generated alerts during summer 2022 are visualized in Figure 4.5. The selected area shown in this figure is where the majority of the alerts were generated in Gothenburg during the summer of 2022. Figure 4.5: Heatmap of generated slippery road alerts within an area in Gothen- burg during summer 2022. 34 4. Data Description and Analysis The vast majority of generated alerts marked in red are close to speed bumps in this specific area. The reason why some of the false positive alerts are generated close to speed bumps could be due to hard braking or driving on the speed bump since both cause the estimated friction to change. A large number of alerts were generated close to a specific speed bump shown in the upper left corner. The location of this speed bump was included in the data from Trafikverket, thus it could be visualized as a black dot. However, speed bumps that were temporally or recently placed such as the speed bumps shown in the bottom right corner, were not included in the road data and could therefore not be used for further analysis and processing. Figure 4.6: Number of friction measurements against the daily air temperature and daily precipitation levels. As can be seen in Figure 4.6, both the daily air temperature and precipitation from SMHI correlate with the number of low estimated friction coefficients sent to the cloud each day. The number of low friction values is drastically increased only when the daily air temperature is below 0 ◦C and the daily precipitation is above 0 mm. These days were the 3rd of February and the 7th to 10th of March. This is reflected in the number of generated alerts previously shown in Figure 4.3. For other days, even though the daily precipitation was above 0 mm, the number of low friction values sent to the cloud was not increased much since the daily air temperature was above 0 ◦C. Hydroplaning might be an explanation for some of the true positives for these cases. Furthermore, some low friction values are sent to the cloud even though the daily air temperature was above 0 ◦C and the daily precipitation was 0 mm. These data points most likely correspond to false positive situations. 35 4. Data Description and Analysis 36 5 Methods This chapter describes the methods used in this thesis and how the project was carried out based on the data analysis from the previous chapter. Section 5.1 be- gins with defining true- and false positives in the context of slippery road alerts. In Section 5.2, the proposed solution, a false positive filter is presented. Section 5.3 describes how the raw vehicle data was processed including sequence generation, sequence mapping, and sequence grouping. Section 5.4 describes the final data sets used to train, test and evaluate the machine learning models for each approach. Section 5.5 describes the architectures of the machine learning models and the ex- periments conducted in this thesis. Finally, Section 5.6 describes the evaluation process of the final models for both the autoencoder- and active learning approach. 5.1 Defining True- and False Positive SRAs The first step of the project was to define true- and false positives in the context of slippery road alerts (SRAs). There are many possible causes for SRAs; some of them are ambiguous and therefore more challenging to classify. One can for example argue that SRAs that are generated when vehicles are driven on a gravel road should be classified as a true positive alert since it caused the vehicle to slip. However, since alerts caused by these types of situations depend on many other parameters and rarely occur in the same location, the alert might not be useful for the driver. Ultimately, defining true- and false positive SRAs can be divided into two types of definitions, one where the actual slipperiness is central and the other that focuses more on the value of receiving the alert. For this project, true positive alerts are defined as alerts generated in situations that lead to loss of traction. Conversely, false positive alerts are defined as alerts generated in situations when the vehicle did not slip. 37 5. Methods Table 5.1: Definition of true- and false positive slippery road alerts. True positives False positives 1 Ice/snow slip Driving on speed bump 2 Hydroplaning Driving on road crack/pothole 3 Gravel slip Making a sharp turn 4 Mud slip Rapid braking 5 Slip on wet leaves High acceleration Some of the most common causes of true- and false-positive alerts using this defini- tion are shown in Table 3.1. In this definition, true positive SRAs depend on road type and environmental factors. Conversely, false positive SRAs are more related to road quality, road design, and vehicle maneuvering. The available data for this thesis sets the limit for which of the true- and false-positive alerts can be considered in developing the false-positive filter. Hence, false positive causes 2-5 could not di- rectly be classified since the available data lacks features that are needed to classify them. 5.2 False Positive Filter The main goal of this thesis was to create a false positive (FP) filter capable of reducing the number of generated false positive alerts in a (SRA) system as shown in Figure 5.1. Filtering solely on the current weather conditions for each specific situation is not enough for a rigorous solution. For example, some situations when the probability of ice formation on the road surface is high can still be caused by false positive situations e.g. driving on a speed bump or a pothole. The idea was to create sequences of the provided vehicle data, train ML models on that data, and finally remove sequences corresponding to false positive situations e.g. driving on a speed bump or making a sharp turn. The sequential data corresponding to true positive situations is then passed to an SRA system where it is processed to create SRAs that are sent to other nearby vehicles. There are two main reasons why we focused on creating a false positive filter. First, implementing an entire SRA system was impractical due to the limited time of the project. A false positive filter that acts alongside a current SRA system is less time-consuming to implement. Second, the bursts of data points are even further processed and filtered in current SRA systems. Hence, the number of valuable features for training the machine learning models is reduced going further into the system. Therefore, the filter was placed as close to the raw vehicle data as possible, i.e. when the data is sent to the cloud. 38 5. Methods Figure 5.1: High-level overview of the proposed false positive filter. 5.3 Data Preprocessing The process of generating a data set from the raw vehicle data is divided into the following three main parts: sequence generation, sequence mapping, and sequence grouping. First, the unprocessed vehicle data was clustered based on both time and location to form sequences for unique vehicles. Second, each sequence was mapped to weather stations, speed bump locations, and road types. Finally, the sequential vehicle data was grouped to create the final data sets used to train, test, and evaluate the machine learning models. 5.3.1 Sequence Generation The available vehicle data consisted of unlabeled spatial-temporal data points in- cluding the estimated friction coefficient, friction quality, temperature, and wiper speed. Based on the initial analysis of this data, it seemed difficult to distinguish between false- from true-positives solely on point measurements with the available features. A single point measurement consisting of the friction coefficient and fric- tion quality is not informative enough to determine if the point corresponds to a true- or false-positive situation. There are many other variables needed to make accurate predictions of the situation including vehicle speed and forces acting on the wheels. Additionally, the available data was unlabeled which increases the complexity of the 39 5. Methods task of detecting and reducing false positive alerts. The data analysis revealed that vehicles generate sequences of data points. A hy- pothesis was that both the estimated friction coefficient and friction quality will change over time differently depending on if the situation corresponds to a true pos- itive situation e.g. driving on ice or a false positive situation e.g. driving on a speed bump. Furthermore, creating sequences of the vehicle data introduces dependencies between data points in consecutive order. This makes sequences more informative than a single point measurement. The idea was therefore to train machine learning models to learn patterns in the sequential vehicle data such that sequences corre- sponding to false positive situations could be removed. The process of creating these sequences was based on both temporal and spatial clustering. The temporal clustering method is shown in Figure 5.2. First, each data point was grouped by the vehicle identification number (VIN) and sorted by the creation time. The creation time represents when each specific data point was created in the cloud and was the only accessible time variable in the vehicle data. The main goal of the temporal clustering was to form rough initial clusters that would reduce the execution time of the spatial clustering later on. Figure 5.2: Temporal clustering of the raw vehicle data. For a specific vehicle, two consecutive points p and q are considered to belong to the same sequence if the creation time between the points tp and tq is within n seconds. Otherwise, the points are placed into separate sequences. A too-small value for n would result in a split of sequences belonging to the same situation. Conversely, a too-large value for n could result in spatial collisions of sequences for a specific 40 5. Methods vehicle. For example, a vehicle might generate a sequence of points at time t0 on location l and might generate a sequence at time t1 on the same location l. Hence, the spatial clustering later on would cluster these sequences into one single sequence which should be avoided since they do not correspond to the same situation. For this initial clustering, n was set to 10 seconds to ensure that each point for a specific situation is captured while also minimizing the risk of spatial collisions of sequences. After adding the next consecutive point q to the current sequence or creating a new sequence starting with q, the next consecutive pair of data points will consist of the point q and the next point in the list of points sorted by the creation times. The loop continues until each data point for each vehicle belongs to a sequence. The sequences generated from the initial temporal clustering were further clustered using the DBSCAN clustering algorithm with Haversine distance as the distance measure, which is covered in section 3.4. The value of epsilon in DBSCAN should be as small as possible such that points that are related to the same situation are clustered together. However, a too-small epsilon would split sequences that relate to the same situation into different clusters. Conversely, a too-large value for epsilon would have the opposite effect, i.e. larger clusters containing sequences from multiple situations. Creating sequences of the vehicle data proved to be a challenging task since many unknown variables such as vehicle speed, road surface, and road construction affect the properties of a sequence including its length, sparseness, and form (e.g. straight or curved). During the analysis of the vehicle data, it was discovered that data points on high- ways were more sparse compared to data points on other roads. Using a fixed small value for epsilon in DBSCAN would lead to multiple clusters for data points that relate to the same situation on highways. Increasing epsilon would solve this issue on highways. However, it would introduce issues on other roads where points were more densely packed. Instead, the value for epsilon was computed based on the road maximum speed (RMS) in km/h for each data point, as follows: ϵ = ( RMS 3.6 ) k (5.1) The road maximum speed was divided by 3.6 to go from km/h to m/s and multiplied with a constant k that was tuned visually and finally set to 1. For example, if the vehicle was driven on a road with a maximum road speed of 80 km/h, the value for epsilon would be set to 22 meters. Hence, for two points to be considered neighbors in DBSCAN, the distance between the points would need to be less than or equal to 22 meters when driving on a road with a maximum legal speed of 80 km/h. Ultimately, after running DBSCAN on the entire vehicle data, a total of 6 million sequences were generated. 5.3.2 Sequence Mapping After the sequence generation process of the vehicle data, each sequence was mapped to weather stations, speed bump locations, and road types. The mapping was imple- mented using spatial hashing (described in Section 3.6) which significantly reduced 41 5. Methods Figure 5.3: Sequence to speed bump matching. the number of comparisons needed. The gathered vehicle data contained approxi- mately 115 million data points which would result in billions of comparisons needed using a naive approach. Weather station mapping Each sequence was mapped to the closest weather station such that the weather data would more accurately represent the current weather conditions for specific sequences. However, it was impractical and unnecessary to compare the distances between every point within a sequence with the locations of the weather stations. Instead, a single point from a sequence was sampled to represent the location of the entire sequence. The distance between this point and the locations of each weather station was computed. The weather station closest to the point was then mapped to the sequence to which the point belonged. This procedure was applied for each sequence in the sequential vehicle data. Finally, the current temperature and precipitation levels were mapped to the closest half-hour of the sequential vehicle data since the weather data had a time interval of 30 minutes. Speed bump location mapping The mapping of speed bump locations was implemented in two steps. First, a sequence was considered to be close to a speed bump if each point of the sequence was within a radius of 15 meters of the speed bump, as shown in Figure 5.3. Points outside this radius were removed from the sequence since they were assumed to not relate to the same situation. However, this could lead to wrong mappings, e.g. sequences being mapped to speed bumps on a different road. Second, to counteract this issue, at least one point of a sequence needed to be within a radius of 4 meters from the speed bump. Road type mapping The mapping of sequences to road types was more complicated than the other map- pings since the road type data contained polygons in contrast to singular points for the weather stations and speed bump locations. The road type data contained data for asphalt and gravel roads. Hence, it was only necessary to map points within each sequence to gravel roads since the rest of the sequences could be assumed to belong to asphalt roads. To map the sequences to gravel roads, the polygons were first converted to a set of coordinates. Points within a sequence were then mapped to a gravel road if any point was within a radius of 30 meters of a gravel road point. 42 5. Methods 5.3.3 Sequence Grouping The final data sets used to train, test, and evaluate the machine learning models were created by grouping the sequential vehicle data on the current weather conditions, speed bump locations, and road types. The sequential vehicle data was grouped based on the current weather conditions from Trafikverket using the flowchart shown in Figure 5.4. The flowchart was created based on assumptions under w