Remaining useful life classification of ECUs in trucks using a transformer encoder model Master’s thesis in Computer Science and Engineering FREDRIK NYSTRÖM AXEL SIWMARK Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s thesis 2025 Remaining useful life classification of ECUs in trucks using a transformer encoder model Fredrik Nyström & Axel Siwmark Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 Remaining useful life classification of ECUs in trucks using a transformer encoder model Fredrik Nyström Axel Siwmark © Fredrik Nyström, Axel Siwmark 2025. Supervisor: Oana Geman, Department of Computer Science and Engineering Advisors: Gilberto Hishida, Caio Alves and Shima Saadatimolaee, Volvo Group Examiner: Robin Adams, Department of Computer Science and Engineering Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: An electronic control unit component for a Volvo combustion engine. Typeset in LATEX Gothenburg, Sweden 2025 iv Remaining useful life classification of ECUs in trucks using a transformer encoder model Fredrik Nyström & Axel Siwmark Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Originally developed for natural language processing, transformer models have achi- eved state-of-the-art results in tasks such as machine translation and text classifica- tion. This has led to increasing interest in applying the transformer architecture to sequential data across multiple other domains. This thesis takes a binary classification approach to investigate whether a trans- former encoder model can be used to classify the remaining useful life of electronic control units (ECUs) in Volvo trucks. The model is trained on operational data and faults related to the ECU, to predict whether an ECU is likely to fail within the following three years. The performance of the transformer model is evaluated against traditional machine learning classifiers, including logistic regression, LGBM, Extra Trees, and Random Forest. In addition to standard metrics, a custom cost metric is introduced to reflect the real-world impact of false positives and false negatives. Results show that the transformer encoder outperforms traditional models across all evaluation metrics, particularly when used with ensemble methods. However, the transformer encoder still underperformed compared to a naive classifier on the custom cost metric. This work serves as a starting point for improving the decision- making process in ECU refurbishment. Keywords: electronic control unit, machine learning, remaining useful life, trans- former, Volvo. v Acknowledgements We are grateful to our supervisor, Oana Geman, for her guidance throughout the thesis. We would also like to thank our supervisors at Volvo, Gilberto Hishida, Caio Alves, and Shima Saadatimolaee for their valuable support. Additionally, we extend our thanks to everyone at E&E for their friendliness and the enjoyable fikas. Finally, we acknowledge ChatGPT and Grammarly which was used for proofreading and grammar correction during the writing process. Fredrik Nyström & Axel Siwmark vii Contents List of Figures xi List of Tables xiii List of Abbreviations xv 1 Introduction 1 1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Outline of Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Extremely Randomized Trees Classifier . . . . . . . . . . . . . 7 2.1.4 Light Gradient Boosting Machine Classifier . . . . . . . . . . . 8 2.2 Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Input Embedding and Positional Encoding . . . . . . . . . . . 10 2.2.2 Encoder Architecture . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Self-Attention and Multi-Head Attention . . . . . . . . . . . . 11 2.2.4 Feed-Forward Network . . . . . . . . . . . . . . . . . . . . . . 12 2.2.5 Final Layer for Classification . . . . . . . . . . . . . . . . . . . 12 2.3 Regularization Techniques in Neural Networks . . . . . . . . . . . . . 12 2.3.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Metrics for Model Evaluation . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Accuracy and Balanced Accuracy . . . . . . . . . . . . . . . . 16 2.4.2 Weighted F1 Score . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Multiple Imputation by Chained Equations . . . . . . . . . . . . . . . 17 3 Methods 19 3.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 ix Contents 3.2 Data Exploration and Compilation . . . . . . . . . . . . . . . . . . . 19 3.3 Data Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Data Cleaning and Aggregation . . . . . . . . . . . . . . . . . 21 3.3.2 Overview of Data Classes . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Dataset overview . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.4 Handling Missing Values with MICE . . . . . . . . . . . . . . 24 3.3.5 Rescaling Data with Robust Scaler . . . . . . . . . . . . . . . 24 3.4 Selecting Baseline Models Using Lazy Predict . . . . . . . . . . . . . 25 3.5 Architecture and Hyperparameter Search for Transformer Model . . . 26 3.6 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Results 31 4.1 Results for Baseline Models . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Results for Transformer Models . . . . . . . . . . . . . . . . . . . . . 32 5 Discussion 35 5.1 Comparison of Model Performance . . . . . . . . . . . . . . . . . . . 35 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Bibliography 39 x List of Figures 2.1 Plot of the sigmoid function f(z) = 1 1+e−z . . . . . . . . . . . . . . . . 6 2.2 Random forest architecture with three decision trees were each tree’s result is combined with majority voting resulting in the final class prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Extra trees architecture with three decision trees using the entire dataset, and predictions aggregated through majority voting. . . . . . 8 2.4 Illustration of LGBM training process for 3 iterations. Each tree is trained sequentially on residuals from the previous tree, and they grow leaf-wise as indicated by the blue leaves. . . . . . . . . . . . . . 9 2.5 Structure of a general transformer encoder architecture. . . . . . . . . 10 2.6 Structure of a general encoder layer. . . . . . . . . . . . . . . . . . . 11 2.7 Dropout example. Left: A neural network with two hidden layers. Right: The same neural network with dropout applied, where the neurons marked with crosses have been dropped. . . . . . . . . . . . . 13 3.1 An example of the monthly aggregation method with mean replacement. 21 3.2 Class distribution of the tabular dataset between the classes: healthy and failed vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Sequence distribution between the two classes: healthy and failed vehicles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 The length of the sequences and their frequency for the two classes: healthy and failed vehicles. . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Final transformer encoder architecture based on parameters from the exhaustive search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 Correct and incorrect predictions made by the ensemble model from the random search, grouped by the number of agreeing models (three, four, or five). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Correct and incorrect predictions made by the ensemble model from the exhaustive search, grouped by the number of agreeing models (three, four, or five). . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 xi List of Figures xii List of Tables 2.1 Definition of True Positives, False Positives, True Negatives, and False Negatives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Python libraries used in the thesis with their version and use case. . . 19 3.2 The eleven original operational features and the number of columns per feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 The eight final operational features and the number of columns per feature after merging intervals. . . . . . . . . . . . . . . . . . . . . . . 20 3.4 An example sequence for one vehicle which gets label 2 since the final reading has class 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Performance of the top 10 traditional classifiers based on accuracy using Lazy Predict. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 Hyperparameter search space for random search with a total of 6912 combinations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.7 Best random search hyperparameter configuration after 1000 iterations. 27 3.8 Hyperparameter search space for local exhaustive search with a total of 243 combinations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.9 Best local exhaustive search hyperparameter configuration. . . . . . . 28 3.10 Outline of confusion matrix for binary classification. . . . . . . . . . . 29 3.11 Estimated cost ratio between FP and FN for the RUL of ECUs. The cost for TP and TN are zero. . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Results of baseline models. This also includes naive classifiers which either predicts only healthy or only failed. . . . . . . . . . . . . . . . 31 4.2 Confusion matrix for Logistic Regression . . . . . . . . . . . . . . . . 32 4.3 Confusion matrix for Model Random Forest . . . . . . . . . . . . . . 32 4.4 Confusion matrix for Model Extra Trees . . . . . . . . . . . . . . . . 32 4.5 Confusion matrix for Model LGBM . . . . . . . . . . . . . . . . . . . 32 4.6 Results of transformer encoder models and naive classifiers on the test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.7 Average confusion matrix over 5 runs for non-ensemble Transformer model with random search parameters. . . . . . . . . . . . . . . . . . 33 4.8 Average confusion matrix over 5 runs for non-ensemble Transformer model with exhaustive search parameters. . . . . . . . . . . . . . . . 33 4.9 Confusion matrix for ensemble Transformer model with random search parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 xiii List of Tables 4.10 Confusion matrix for ensemble Transformer model with exhaustive search parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 xiv List of Abbreviations C-MAPSS - Commercial Modular Aero-Propulsion System Simulation CNN - Convolutional Neural Network ECU - Electronic Control Unit GRU - Gated Recurrent Unit LGBM - Light Gradient Boosting Machine LSTM - Long Short-Term Memory MLP - Multi Layer Perceptron MICE - Multiple Imputation by Chained Equations RNN - Recurrent Neural Network RUL - Remaining Useful Life RVR - Relevance Vector Regression SVR - Support Vector Regression xv List of Abbreviations xvi 1 Introduction Remaining Useful Life (RUL) is a key metric in data-driven methods used to evaluate the health of components. It is applied in various domains, including battery health monitoring, fleet vehicle maintenance, and aircraft engine diagnostics [1][2][3][4][5]. In recent years, interest in data-driven approaches for RUL prediction has been steadily increasing. This trend is largely driven by the advancements of sensor and measurement technologies along with wireless data collection [6]. The automotive industry is an industry that generates large amounts of data due to its connected vehicles. Volvo, a Swedish truck manufacturer with substantial data resources, is exploring data-driven solutions for RUL prediction of truck com- ponents. Modern Volvo trucks consist of many interconnected components. One of the most critical components is the engine Electronic Control Unit (ECU), also known as the Engine Control Module (ECM), which will be referred to as the ECU for the remainder of this thesis. The ECU controls various functions, including the electronic input and output of the engine. When an unknown error occurs in a truck, the ECU is removed for assessment. This assessment involves performing multiple tests on the ECU, such as sending cur- rent through it. If the ECU passes these tests it may be refurbished and reinstalled in a truck, otherwise it is discarded. Volvo is not only interested in determining whether a refurbished ECU is reusable but also interested in predicting its RUL. Ac- curate RUL prediction helps determine whether reinstalling the ECU is a reasonable choice, leading to cost savings through reduced maintenance and fewer breakdowns resulting in less downtime. 1.1 Related Work Numerous studies have been conducted on predicting the RUL of components and systems, typically categorized into physics-based, experimental, and data-driven methods. While physics-based methods can perform well, they are often difficult to develop for complex environments. Experimental methods are costly and may have scaling issues, limiting their real-world relevance. Therefore, this work focuses on data-driven methods, which are flexible and can use the available information more completely [7]. 1 1. Introduction Data-driven methods for RUL prediction have been mainly oriented around deep learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) since they work well on time series data. Multiple papers and models for RUL prediction have been evaluated on the Commercial Modu- lar Aero-Propulsion System Simulation (C-MAPSS) datasets which are simulated datasets created by NASA [8]. One example is the paper by Sateesh Babu et al. [2], which showed that a novel CNN could outperform other models such as Multi- Layer Perceptron (MLP), Support Vector Regression (SVR) and Relevance Vector Regression (RVR). In a study by Li et al. [3] a CNN with four convolutional layers and dropout was also evaluated on the C-MAPSS datasets and outperformed sev- eral other models, including Long Short-Term Memory (LSTM) and Random Forest. Multiple studies have explored the application of RNNs and their variants such as LSTM and Gated Recurrent Units (GRU), for RUL prediction. In the paper by Cheng et al. [9] an ensemble learning approach of LSTMs was evaluated on the C- MAPSS datasets. Their ensemble learning LSTM model outperformed both simple models, such as SVR, and RVR, but also more advanced architectures, including deep CNNs. Models like RNNs and LSTM networks have commonly been used for time series tasks such as RUL prediction. In 2017, a new architecture called the Transformer was introduced by Vaswani et al. [10]. Originally developed for natural language processing, the Transformer has achieved great success in that domain. This has led to growing interest in applying transformers to other areas, including regression problems like RUL prediction. In the paper by Chen et al. [4] a transformer model was used to predict the RUL of lithium-ion batteries. The transformer outperformed several other state-of-the-art models such as LSTM and GRU. In a recent study by Ogunfowora and Najjaran [5] it is shown that an encoder-transformer model with an expanding window method as data preparation also outperforms state-of-the-art models on the C-MAPSS datasets. In the paper by Zerveas et al. [11], a transformer encoder architecture was applied to several multivariate time series datasets from different domains. They demonstrated that it often outperformed other methods for both regression and classification tasks, even with datasets containing only a few hundred training samples. For classification tasks, the model showed an average accuracy improvement of 2-25 percentage points over other models across eleven datasets. In 2024, Scania, a Swedish truck company, released SCANIA Component X Dataset [1] which is a real-world multivariate time series dataset for predictive maintenance and RUL prediction. This dataset was released to establish a benchmark for predic- tive maintenance and support reproducible research. The dataset used in this work was developed based on inspiration from the SCANIA Component X Dataset. Recent studies have shown that the encoder-transformer performs well on RUL pre- diction tasks [4][5][11], making it the main focus of this work. This study takes a classification approach to predicting the RUL of ECUs using time-sequential vehicle data from Volvo. To evaluate the performance of the encoder-transformer model, 2 1. Introduction traditional machine learning methods such as Logistic Regression and tree-based ensemble models will also be used for comparison. 1.2 Purpose The purpose of this thesis is to develop and evaluate a transformer-based machine learning model for predicting the RUL of ECUs. Along the way to this objective, three subgoals will be addressed. • SG1: Evaluate and compare the transformer encoder model against tradi- tional machine learning models. • SG2: Optimize model performance by fine-tuning the hyperparameters of a transformer encoder model. • SG3: Explore if ensemble methods can improve the transformer encoder model’s performance. 1.3 Scope In order to make the data comparable, only data from the European region were considered. This also ensured similar regulatory standards for the trucks. To further narrow the scope, the data used were from heavy-duty trucks with the same ECU family. Additionally, due to time constraints, the number of models explored was lim- ited to primarily focus on the transformer encoder model. To provide a benchmark, four traditional machine learning models were included for baseline comparison. 1.4 Outline of Report The structure of the remainder of this thesis is organized as follows: Chapter 2 presents the relevant theory. Chapter 3 describes the approach and methodology. The results are presented in Chapter 4. Finally, Chapter 5 provides a discussion and conclusion of the results, along with suggestions for future work. 3 1. Introduction 4 2 Theory This chapter provides information about the theoretical foundations and models used throughout this thesis. It begins by describing several baseline classifiers, including Logistic Regression, Random Forest, Extra Trees, and Light Gradient Boosting Ma- chine. After that, the Transformer architecture is introduced. This is followed by an overview of regularization techniques commonly used in neural networks to prevent overfitting and improve generalization. Additionally, a description of the Multiple Imputation by Chained Equations method used for handling missing data is presented. Finally, the evaluation metrics used to assess model performance are defined. 2.1 Baseline Models The baseline models in this thesis are standard machine learning classifiers that serve as reference points for evaluating the performance of the more advanced transformer model. All the baseline models except logistic regression use decision trees. A decision tree is a structure where, at each node, a decision is made based on the value of a specific feature. This decision determines which branch to follow next as the input vector moves down the tree. The class is assigned to the input vector when it reaches the leaf node. All the tree-based models are also ensemble models, meaning that they build and combine multiple trees. 2.1.1 Logistic Regression When the dependent variable is binary, as in this thesis, binary logistic regression is employed. The primary objective in binary logistic regression is to estimate the probability that the dependent variable Y equals 1, given a set of independent input features. The input features form a linear combination (see Equation 2.1), where each feature x has a β that controls its impact, additionally, a bias term β0 is added. This linear combination is then fed into the Sigmoid function (see Equation 2.2), which transforms the linear combination of the input features into a probability between 0 and 1, see Figure 2.1. z = β0 + β1x1 + β2x2 + · · ·+ βnxn (2.1) 5 2. Theory f(z) = 1 1 + e−z (2.2) −10 −5 5 10 0.5 1 z f(z) Figure 2.1: Plot of the sigmoid function f(z) = 1 1+e−z The parameters β0, β1, . . . , βn are learned from the training data. They are typically initialized randomly or with small values and then updated through an optimiza- tion process that seeks to maximize the likelihood of the observed outcomes. Finally, applying the sigmoid function to z yields the estimated probability that the depen- dent variable Y equals 1, as shown in Equation 2.3. This probability is then what determines the predicted class. P (Y = 1 | x1, x2, . . . , xn) = 1 1 + e−(β0+β1x1+β2x2+···+βnxn) (2.3) 2.1.2 Random Forest Classifier Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs to make a final prediction [12]. It introduces two sources of randomness to increase the diversity of the trees. The first source of randomness is that it uses bootstrap sampling which means that each tree is trained on a randomly selected subset of the entire dataset, as shown in Figure 2.2. The second source of randomness is that a random subset of features, rather than all the features, is considered when splitting nodes during tree construction. The split is determined by selecting the feature and threshold that best separates the data, minimizing impurity within the resulting child nodes. Impurity is a measure of how mixed the class labels are within a node. A node is pure if all samples in it belong to a single class and considered impure if it contains samples of multiple classes. A common impurity measure for classification is Gini impurity [13]. Given C distinct classes, and p(i) representing the probability of selecting a data point at a child node belonging to class i, the Gini impurity is defined as shown in Equation 2.4. 6 2. Theory G = C∑ i=1 p(i) · (1− p(i)) (2.4) By minimizing impurity at each split, the decision tree becomes more effective at separating the data into homogeneous groups. Splitting typically continues until a predefined maximum depth of the tree is reached or other stopping criteria are met, such as a minimum number of samples per node. DATASET SUBSET-1 RESULT-1 DECISION TREE-1 SUBSET-2 RESULT-2 DECISION TREE-2 SUBSET-3 RESULT-3 DECISION TREE-3 MAJORITY VOTING FINAL CLASS Figure 2.2: Random forest architecture with three decision trees were each tree’s result is combined with majority voting resulting in the final class prediction. 2.1.3 Extremely Randomized Trees Classifier Extremely Randomized Trees (Extra Trees) shown in Figure 2.3 is also an ensemble learning method based on decision trees [14]. Similarly to Random Forest it builds multiple trees and aggregates their predictions, for example through majority voting. Extra Trees introduces additional randomness into the tree-building process and differs from traditional decision tree ensembles in two key ways. Firstly, unlike Random Forest which uses random subsets of the data to grow each tree (as shown in Figure 2.2), this method uses the entire original dataset for each tree. Secondly, during splitting Extra Trees selects the splitting threshold at random compared to Random Forest which determines the optimal threshold to minimize impurity. This randomness increases diversity among the trees and can lead to better performance. 7 2. Theory DATASET RESULT-1 DECISION TREE-1 RESULT-2 DECISION TREE-2 RESULT-3 DECISION TREE-3 MAJORITY VOTING FINAL CLASS Figure 2.3: Extra trees architecture with three decision trees using the entire dataset, and predictions aggregated through majority voting. 2.1.4 Light Gradient Boosting Machine Classifier Light Gradient Boosting Machine (LGBM) is a Gradient Boosting Decision Tree model developed by Microsoft [15]. A Gradient Boosting Decision Tree is an ensem- ble model that builds trees sequentially. For LGBM, the trees are grown leaf-wise, as shown in Figure 2.4. Each new tree attempts to predict and correct the errors (residuals) of its predecessor. The tree is then added to the ensemble with the previ- ous trees to improve overall performance. This process is repeated iteratively until the model reaches a predefined number of trees. When the final prediction is made, each tree contributes a fixed score based on which leaf node a sample falls into. These scores are scaled by the learning rate and summed together to produce the final score. For binary classification, the fi- nal score is passed through the sigmoid function to generate a probability, which then determines the predicted class. Each leaf score represents the optimal constant value that minimizes the loss function for the data samples assigned to that leaf. In other words, if a tree were used on its own the leaf score would be the best possible prediction for all samples that reach that leaf. This is based on the residual errors at that point in training. The boosting process then adds up these contributions from many such trees to build a strong overall model. LGBM introduces two key techniques, Gradient-Based One Side Sampling, and Exclusive Feature Bundling, to achieve faster training and better memory efficiency compared to traditional gradient boosting methods. Gradient-based One-Side Sam- pling speeds up training by selecting a subset of the data based on gradient values. It keeps all data instances with large gradients which are those the model is currently performing poorly on and randomly samples from instances with small gradients. 8 2. Theory This lets the model focus on the most informative data points while reducing com- putational need. Exclusive Feature Bundling reduces the number of features by combining mutu- ally exclusive features that never take non-zero values at the same time into a single feature. This is possible since these features do not overlap in terms of their non-zero values so only information about which original feature corresponds to a zero value is lost. This reduces the feature dimensionality without significant information loss which leads to faster training and reduced memory usage. Tree 1 Tree 2 Tree 3 Figure 2.4: Illustration of LGBM training process for 3 iterations. Each tree is trained sequentially on residuals from the previous tree, and they grow leaf-wise as indicated by the blue leaves. 2.2 Transformer Model The Transformer was originally introduced by Vaswani et al. [10]. It is an encoder- decoder architecture based on self-attention that takes a sequence as input and outputs a sequence. The encoder turns the input sequence into a high-dimensional vector that the decoder uses to produce an output sequence. This transformer archi- tecture produced state-of-the-art results in natural language processing tasks such as machine translation and general language understanding. Unlike recurrent neural networks and convolutional neural networks, transformers do not rely on recurrence or convolution to capture sequential or spatial dependencies in data. Instead, trans- formers use self-attention to capture relationships between feature vectors at each timestep in the sequence. This means that each feature vector attends to all other feature vectors in the sequence simultaneously. As a result, transformers handle long-range dependencies more effectively, allowing for parallelization that speeds up training. To perform classification and produce a single prediction rather than a sequence, only the encoder of the Transformer is used, similar to the approach of Zerveas et al. [11]. This encoder-based setup consists of three main components: input em- 9 2. Theory beddings combined with positional embeddings, the encoder layers, and the output layer, see Figure 2.5. Figure 2.5: Structure of a general transformer encoder architecture. 2.2.1 Input Embedding and Positional Encoding The input to the encoder is first passed through an embedding layer which projects the input to a higher-dimensional space. This gives the model more room to express relationships between features. Since the Transformer does not process the inputs sequentially, positional embeddings are added to the input embeddings to preserve the order of elements in the input sequence, see Figure 2.5. One approach to selecting positional embeddings is to use sinusoidal positional encodings [10], while another is to use fully learnable positional encodings [11]. 2.2.2 Encoder Architecture The encoder consists of N stacked, identical encoder layers. Each layer (see Fig- ure 2.6) contains two main components: a multi-head self-attention mechanism (explained in Section 2.2.3) and a position-wise feed-forward network described in Section 2.2.4. Each of these components is surrounded by a residual connection, followed by layer normalization as represented in the formula: LayerNorm(x + Sublayer(x)) Here, x is the input to the sublayer (either the attention mechanism or the feed- forward network), and Sublayer(x) is the output produced by applying that sublayer to the input. Residual connections help gradient flow through deep networks [16], and layer normalization improves convergence. To speed up the residual connections, all layers in the model and the embeddings have the same dimension [10]. 10 2. Theory Figure 2.6: Structure of a general encoder layer. 2.2.3 Self-Attention and Multi-Head Attention The self-attention mechanism is a fundamental building block of the Transformer architecture. It enables the model to compute relationships and similarities between embedding vectors in the input sequence. This is accomplished by transforming the input sequence into three matrices: queries (Q), keys (K), and values (V ). Q = XW Q, K = XW K , V = XW V The input sequence X ∈ Rn×dmodel consists of n embedding vectors with an em- bedding size dmodel. The weight matrices W Q, W K , W V ∈ Rdmodel×dk are learnable parameters that are initialized with the Xavier Glorot Initialization [17] and used to project the input into query, key, and value spaces. For a single head attention each of the matrices Q, K, and V has dimensions Rn×dk , where dk = dmodel h = dmodel 1 = dmodel. The attention output is then computed using the scaled dot-product atten- tion mechanism introduced by Vaswani et al. [10], as shown in Equation 2.5. This operation produces a new representation of each embedding vector as a weighted sum of all embedding vectors in the sequence with their learned weights. Attention(Q, K, V ) = softmax ( QK⊤ √ dk ) V (2.5) In multi-head attention, the attention function is applied independently and in par- allel across h separate heads, each creating and using its own linear projections of the queries, keys, and values. Each attention head (see Equation 2.6) focuses on different aspects of the input, allowing the model to capture a wider variety of dependencies 11 2. Theory across the sequence. These independent attention outputs are then concatenated and projected back to the original dimensions of Rn×dmodel , see Equation 2.7. headi = Attention(Qi, Ki, Vi) for i = 1, . . . , h (2.6) MultiHead(Q, K, V ) = Concat(head1, . . . , headh)W O (2.7) Where Qi, Ki, Vi ∈ Rn×dk , where n is the sequence length, and dk = dmodel h is the di- mensionality of each attention head. The learnable weight matrix W O ∈ Rhdk×dmodel projects the concatenated result back to the original embedding size of dmodel. The concatenated output of all attention heads in Equation 2.7 has dimensions Rn×hdk . 2.2.4 Feed-Forward Network After the multi-head attention layer, each encoder layer also contains a Feed-Forward Network (FFN). The FFN is applied independently to each attention embedding in the input sequence. It consists of two fully connected linear layers with an activation function (σ) in between, as shown in Equation 2.8. The first layer projects the input from dmodel to a higher-dimensional space dff, and the second projects it back to dmodel. Specifically, W1 and b1 are the weights and bias for the first linear layer, while W2 and b2 are the weights and bias for the second. The activation function σ introduces non-linearity between the two layers. FFN(x) = σ(xW1 + b1)W2 + b2 (2.8) 2.2.5 Final Layer for Classification The final step in the transformer encoder model is to make a prediction based on the output from the final encoder layer. The encoder layer output, which is a vector representing the final processed representation of the input sequence, is then passed through a final layer with dimensions dmodel × 1, where dmodel is the embedding size. This final layer produces a scalar value, which represents the model’s raw prediction. For binary classification, this raw prediction is then passed through a sigmoid function to convert it into a probability between 0 and 1. A threshold of 0.5 is used to decide the class, with probabilities above and equal to 0.5 classified as class 1 and probabilities below 0.5 classified as class 0. 2.3 Regularization Techniques in Neural Networks Overfitting is a common issue in neural networks, where a model performs well on the training data but poorly on unseen data. This happens when the network becomes too specialized to the training set, learning noise and patterns that do not generalize. Overfitting often occurs when the model is too complex or trained for too many epochs. To reduce the risk of overfitting and improve generalization, techniques like dropout, weight decay, early stopping, and ensemble learning are often used. 12 2. Theory 2.3.1 Dropout Dropout is a regularization technique used to reduce overfitting in neural networks. It works by randomly removing a subset of neurons during each forward pass in training. This means that selected neurons and their incoming and outgoing con- nections are temporarily removed from the network during training. The idea is that this forces the network to learn general features from the data since it cannot only rely on the presence of specific neurons. An example of dropout applied to a neural network with two hidden layers is shown in Figure 2.7. Figure 2.7: Dropout example. Left: A neural network with two hidden layers. Right: The same neural network with dropout applied, where the neurons marked with crosses have been dropped. The dropout process is stochastic, meaning that each neuron is set to zero with an independent probability p. For example, a dropout rate of 0.4 means that each neuron has a 40% chance of being dropped in a given training step. This randomness effectively results in the training of multiple smaller subnetworks, each of which sees a slightly different view of the data. At test time, all neurons are active and dropout is not used, but the outputs of neurons are scaled down by the same probability p as used during training. This compensates for the increased number of active units during inference and ensures that the expected output remains consistent between training and testing. Dropout has been shown to improve performance across a wide range of tasks, in- cluding image classification, speech recognition, and translation, demonstrating that it is a general-purpose technique effective in many domains [10] [18] [19]. 2.3.2 Weight Decay The L2 norm penalty commonly known as weight decay, is one of the simplest and most widely used regularization techniques in neural networks [20]. It works by 13 2. Theory penalizing large weights, encouraging the model to keep them as small as possible without forcing them to zero. This is done by adding the penalty term in Equation 2.9 to the objective function, where w represents the model weights and λ is the regularization strength. R(w) = λ 2 ∥w∥2 2 = λ 2 n∑ i=1 w2 i (2.9) Weight decay is a hyperparameter that has two main benefits. Firstly, it helps reduce overfitting by encouraging smaller weight vectors, which often leads to simpler and more robust models. Secondly, when properly tuned, it can reduce noise in the training targets and improve the model’s ability to generalize to unseen data [21]. 2.3.3 Early stopping Early stopping is another regularization technique. It aims to train the model long enough to reach good generalization but stop before overfitting begins [22]. Early stopping is also useful for saving computational resources, especially during cross- validation, where the model is trained multiple times with different hyperparame- ter configurations. With early stopping suboptimal configurations are terminated quickly, reducing unnecessary computation and improving overall efficiency. Stopping at a first indication of overfitting usually leads to missing out on later optima since metrics such as validation performance can vary due to randomness in optimization, batch sampling, or noise in the data. Therefore, various stopping criteria exist that aim to find a good trade-off between generalization and computa- tional cost. Patience-based early stopping is a simple stopping criterion based on the idea that if the validation error does not improve for a set number of epochs (called patience), the model is considered to have started overfitting. In this case, training is stopped, and the model weights corresponding to the lowest validation error are returned. This decision is made even if the increase in validation error is small. The steps are shown in Algorithm 1. 14 2. Theory Algorithm 1 Early Stopping with Patience 1: Input: P (patience), E (number of training epochs) 2: Output: θ (best model weights) 3: best_error←∞ 4: epochs_no_improve← 0 5: θ ← None 6: for e = 1 to E do 7: if validation_error(e) < best_error then 8: best_error← validation_error(e) 9: epochs_no_improve← 0 10: θ ← current model weights 11: else 12: epochs_no_improve← epochs_no_improve + 1 13: end if 14: if epochs_no_improve ≥ P then 15: return θ 16: end if 17: end for 18: return θ 2.3.4 Ensemble Learning Ensemble learning is a technique in machine learning where multiple models called inducers or base learners are combined to improve prediction accuracy. Ensem- ble methods are considered the state-of-the-art solution for many machine learning tasks and can be used with different types of models such as decision trees, neural networks, or linear regression models [23]. By combining the outputs from several models, the chance of errors from any one model is reduced. There are various ways to combine the outputs of the models, but a simple approach for classification is majority voting. In majority voting each inducer has equal impact and the final prediction is the class that receives the most votes from the individual models. Ensemble methods are widely used because they often perform better than indi- vidual models, especially when the models are accurate and make different types of mistakes. This diversity is important because similar models that make the same predictions in an ensemble will not improve results much. One way to create diver- sity is by splitting the training data into different parts and training each model on a separate subset. This idea can be combined with cross-validation, which already splits the dataset in this way. Instead of just using the folds for finding hyperparam- eters, the model from each fold can be saved and later used as part of an ensemble. Despite the higher performance of ensemble models, there are a few downsides to consider. First, they can lead to longer prediction times, especially when the en- semble includes a large number of complex models. Second, ensemble models can reduce interpretability, since the final output is based on the combined decisions of several inducers. 15 2. Theory 2.4 Metrics for Model Evaluation When evaluating classification models, several metrics are used to assess perfor- mance, all of which rely on four key values: True Positives, False Positives, True Negatives, and False Negatives, defined in Table 2.1. Two other important met- rics are precision and recall. Precision measures how accurate the model is in its positive predictions, see Equation 2.10. Recall indicates the proportion of actual positive instances correctly identified by the model, see Equation 2.11. Precision = True Positives True Positives + False Positives (2.10) Recall = True Positives True Positives + False Negatives (2.11) Predicted Class 1 Predicted Class 0 Actual Class 1 True Positives False Negatives Actual Class 0 False Positives True Negatives Table 2.1: Definition of True Positives, False Positives, True Negatives, and False Negatives. 2.4.1 Accuracy and Balanced Accuracy Balanced accuracy is a metric used to evaluate the performance of classification mod- els, and can be particularly useful in cases where class distributions are imbalanced. Unlike standard accuracy (see Equation 2.12), balanced accuracy ensures that each class contributes equally to the overall performance measurement. It is calculated as the average of the recall values for each of the N classes, see Equation 2.13. This provides a better assessment of model performance across imbalanced datasets. Accuracy = True Positives + True Negatives True Positives + False Positives + True Negatives + False Negatives (2.12) Balanced Accuracy = 1 N N∑ i=1 True Positivesi True Positivesi + False Negativesi (2.13) 2.4.2 Weighted F1 Score The weighted F1 score is another metric that accounts for class imbalance. It uses the harmonic mean of precision and recall for each class just like the normal F1 score, but then it also weights each class’s score by wi. Where wi is the number of samples in class i divided by the total number of samples in the dataset, see Equation 2.14. 16 2. Theory This metric is useful when dealing with imbalanced datasets, as it reflects overall performance while considering the distribution of classes. Weighted F1 Score = N∑ i=1 wi · 2 · Precisioni · Recalli Precisioni + Recalli (2.14) 2.5 Multiple Imputation by Chained Equations Multiple Imputation by Chained Equations (MICE) is a method for handling missing data in multivariate datasets [24]. Unlike simpler imputation methods, such as mean or median imputation, MICE accounts for the relationships between variables improving the quality of the imputations. The MICE process begins by filling missing values with placeholder values, for ex- ample, using mean values. Next for each variable with missing values, the algorithm resets it to missing and uses a model to predict the missing values based on the other variables. This step is repeated for every variable with missing values. This is considered one cycle. This process is described in a step by step explanation in Algorithm 2. Algorithm 2 Iterative Imputation Algorithm 1: Input: D (dataset with missing values), K (number of cycles) 2: Output: D̂ (imputed dataset) 3: Let D̂ be a copy of D with missing values replaced by temporary placeholders (e.g., mean/median) 4: for k = 1 to K cycles do 5: for each variable vi with missing values do 6: Set values in vi in that were placeholders back to missing 7: Fit a model Mi on D̂ with vi as the target and the rest as predictors 8: Use Mi to predict and impute missing values in vi 9: end for 10: end for 11: return D̂ The algorithm operates iteratively and refines the imputation with each cycle. Usu- ally between 5-50 cycles are run to ensure that the computation converges to stable and accurate values. The iterative process allows for the imputation to handle more complex dependencies between the variables. 17 2. Theory 18 3 Methods This chapter outlines the tools and methodology used in this thesis. The approach includes data compilation, preprocessing, baseline model selection, hyperparameter search for the transformer encoder model, and model evaluation. In this chapter and the rest of the thesis, vehicles that have experienced an ECU failure are referred to as failed vehicles, while those that have not are referred to as healthy vehicles. 3.1 Tools To view, find, and analyze the different features of the dataset, Azure Data Studio was utilized. The software built in this thesis was developed using Python version 3.13.1. Python is widely used for data science because of its many useful libraries. The libraries used and their use cases are listed in Table 3.1. Library Version Use Case Pyodbc [25] 4.0.30 Connect to the database and send SQL queries. SciPy [26] 1.7.3 Data exploration. NumPy [27] 1.21.0 Compiling, cleaning, and manipulating data. Pandas [28] 1.3.0 Handling structured data. Matplotlib [29] 3.4.2 Visualizing and exploring the data using plots. Seaborn [30] 0.11.1 Visualizing and exploring the data using plots. Scikit-learn [31] 0.24.2 Machine learning models and data preprocessing. LightGBM [15] 4.6.2 Machine learning model. PyTorch [32] 1.9.0 Artificial Neural Networks and Transformer model. Lazy Predict [33] 0.2.1 Quickly testing multiple machine learning models. Table 3.1: Python libraries used in the thesis with their version and use case. 3.2 Data Exploration and Compilation Volvo provided a database containing data from all their trucks, which we explored using Azure Data Studio. We focused on two main types of data. The first type, inspired by the SCANIA Component X Dataset [1], was operational data collected from truck sensors. This data measures the vehicle’s performance, for example, total engine running time or total distance driven with different loads. The features 19 3. Methods in this dataset were either scalar or in histogram format, meaning some features had multiple columns representing different intervals. As an example, the feature distance driven with different load levels have intervals such as 0-2 tons and 2-4 tons. The original operational features are listed in Table 3.2. Row Operational feature #Columns 1 Total distance driven 1 2 Total engine running time 1 3 Total time in power take-off mode 1 4 Total time running in top gear 1 5 Total engine fuel consumption 3 6 Number of gear shifts 6 7 Total time spent with different engine temperature 8 8 Brake and retarder data 9 9 Total time spent in different ambient pressure 12 10 Total distance driven with different loads 32 11 Total distance driven in incline/decline over time 32 Table 3.2: The eleven original operational features and the number of columns per feature. To ensure that data was reported at least monthly, each operational feature was queried across several vehicles. One insight was the high number of zero values in features represented as histograms with multiple intervals, particularly in the outer intervals. To address this, intervals were merged (see Table 3.3) to reduce the number of zero values and improve computational efficiency. Additionally, three operational features listed in rows 6, 7, and 9 in Table 3.2 were removed. This was due to a high proportion of missing values, which is explained in Section 3.3.1. Operational feature #Columns Total distance driven 1 Total engine running time 1 Total time in power take-off mode 1 Total time running in top gear 1 Total engine fuel consumption 1 Brake and retarder data 3 Total distance driven with different loads 7 Total distance driven in incline/decline over time 8 Table 3.3: The eight final operational features and the number of columns per feature after merging intervals. The second source of data consists of fault codes, which are logged when errors occur in engine components. We selected fault codes that were related to the ECU and had occurred in vehicles with ECU failures. This resulted in 59 features and 20 3. Methods 59 columns. By combining the operational data and the fault code data, the final dataset contained 67 features across a total of 82 columns. 3.3 Data Preprocessing Steps Data preprocessing is essential for turning raw data into high-quality input, meaning data that is clean, consistent, and relevant for machine learning models [34]. This follows the well-known concept in computer science: Garbage In, Garbage Out. In other words, if the input data is poor the model’s results will also be poor, regardless of how advanced the models are. The following is the data preprocessing pipeline used for preparing the datasets. 3.3.1 Data Cleaning and Aggregation The first step in preprocessing involved identifying and removing duplicate records to avoid data redundancy, which could skew model training. Additionally, entries unrelated to the task were removed to ensure that only relevant data remained. Next, to minimize the number of missing values in the operational data, a trial- and-error aggregation approach was used to determine the optimal granularity of the time dimension. The goal was to reduce the percentage of missing values while still keeping enough points in the time dimension. Weekly, biweekly, and monthly intervals were explored. Monthly aggregation intervals gave the best results and were used. If there existed multiple readings within an interval, the mean value was used, see Figure 3.1. We considered the mean to be a reasonable choice, given that the data for each feature approximately increased linearly with time. After aggrega- tion, most features had 1-13% missing values, while some had around 50% or more. MICE has been shown to perform optimally when the proportion of missing values is at or below approximately 50% [35]. Since one feature had 53% missing values, we decided to make this the threshold and features with more than 53% missing values were removed. Figure 3.1: An example of the monthly aggregation method with mean replacement. 21 3. Methods 3.3.2 Overview of Data Classes A total of 147 vehicles experienced an ECU failure within three years of the truck’s assembly date, compared to hundreds of thousands of vehicles that did not experi- ence an ECU failure. All trucks in the dataset have been in operation for at most five years. Based on this, the first approach was to divide the data into three classes of similar time frames. Since 99% of the failed vehicles had failed within three years this was decided as our upper class cutoff. • Class 0: 3+ years until failure • Class 1: 1-3 years until failure • Class 2: 0-1 years until failure Two types of datasets were created for the models: one tabular dataset for classical machine learning models and one time-sequential dataset for the time-sequential models. The time-sequential dataset consisted of one sequence of data for each vehicle and the label for each sequence was determined by the class of the last data point in the sequence. This is explained in Table 3.4, which presents a sequence for a vehicle. The sequence is labeled as 2 because the class for the final reading is 2. As a result of this class assignment method, all healthy vehicles were assigned the label 0 and all failed vehicles received the label 2. This created an imbalance in the dataset, as there were no instances of label 1, which was a significant issue. Timesteps until failure 40 30 20 10 0 Class 0 1 1 2 2 Table 3.4: An example sequence for one vehicle which gets label 2 since the final reading has class 2. To solve this issue, we tried to create new sequences with label 1 by cropping the sequences for the failed vehicles. However, this resulted in very short sequences which resulted in the models performing poorly. Instead, we decided to have two classes instead of three. • Class 0: 3+ years until failure • Class 1: 0-3 years until failure When assigning healthy vehicles to Class 0 it was decided to exclude data points from the last three years. This was to ensure that the vehicle operates at least another three years after the last considered data point so it can safely be assigned Class 0. It was also required for the vehicles to have been in operation for at least five years to provide sufficient data. 3.3.3 Dataset overview Two datasets were created from the same data. The first one was a tabular dataset consisting of 9683 rows used for the baseline models. The class distribution of the 22 3. Methods tabular dataset between healthy vehicles and failed vehicles is shown in Figure 3.2. As seen in the figure, there is a four-to-one class imbalance. Figure 3.2: Class distribution of the tabular dataset between the classes: healthy and failed vehicles. For the transformer encoder models a sequential dataset consisting of 432 sequences was used. The class distribution of the sequential dataset is shown in Figure 3.3. As seen in the figure, there is a two-to-one class imbalance. A key consideration with time series data is that the length of sequences can vary a lot, as shown in Figure 3.4. This variation exists not only between the two classes but also within each class. Failed vehicles have shorter sequences because they fail earlier, while the variation in healthy vehicles is due to differences in their production dates. To make the sequences compatible with the transformer encoder model, each one was padded at the beginning with the value -1e6 to reach a uniform length of 36. This value was chosen instead of the more common zero because zeroes can appear in the rescaled data. Since the data is centered around zero, -1e6 provides a safe padding value that is clearly outside the normal data range. These padded values are used to match the required input shape. To ensure they do not affect the model, a padding mask is applied. This mask assigns large negative scores to the padded positions before applying the softmax function in Equation 2.5. This padding length is slightly above the maximum sequence length of 35 months since it creates a three-year window. If the model is used on sequences longer than 36 months, the most recent 36 timesteps are kept. The dataset was split into two parts: 80% for training and 20% for testing while keeping the original class ratio in both sets. 23 3. Methods Figure 3.3: Sequence distribution between the two classes: healthy and failed vehi- cles. 3.3.4 Handling Missing Values with MICE To fill in missing values MICE was used because it can handle complex dependencies between variables and it performs well even when there exist a high amount of missing values [24]. Being able to handle a higher amount of data was important since much of the available data contained a fair amount of missing values. During cross-validation, an imputer was created for each fold by fitting it on the training data, and then used to impute missing values in both the training and validation sets. After cross-validation, a new imputer was fitted on the entire training set and used to impute missing values in both the full training set and the test set. An important aspect of using a MICE imputer is selecting an appropriate model strategy for filling in missing values. Since the dataset consisted of continuous variables, a linear regression model was chosen, as it is the standard and often the most effective option for this type of data [35]. 3.3.5 Rescaling Data with Robust Scaler In the paper by Amorim et al. [36], multiple scalers were compared across sev- eral classification tasks. They found that the Quantile Transformer was best, but it could distort the linear correlations between variables, which would negatively affect the interpretability of the relationships between features. Furthermore, the inverse transformation of the Quantile Transformer does not guarantee recovery of the orig- inal values. Instead, we chose to rescale the data with the Robust Scaler, since it was the best performing scaler that kept the linear relations between features and provided a reliable inverse transformation. This property is valuable for interpreting the model’s behavior in relation to the original features. 24 3. Methods Figure 3.4: The length of the sequences and their frequency for the two classes: healthy and failed vehicles. The Robust Scaler (see Equation 3.1) scales each feature instance xi by first sub- tracting the median of the feature x̄, and then dividing by the interquartile range (IQR). The IQR is the difference between the 75th percentile (x75) and the 25th percentile (x25) of the feature. Scaling the features with the IQR makes this method less sensitive to outliers compared to other methods, such as min-max normalization. xscaled = xi − x̄ x75 − x25 (3.1) 3.4 Selecting Baseline Models Using Lazy Predict To provide a comparison for the transformer encoder model, four baseline models were selected. Three of these were identified using the Lazy Predict library [33], which evaluated 26 traditional classification models on the dataset. Since the dataset was already imputed and scaled the corresponding preprocessing steps were removed from the built-in Lazy Predict pipeline. The top three performing models in terms of accuracy were the Extra Trees Classifier, LGBM Classifier, and Random Forest Classifier, as shown in Table 3.5. A fourth baseline model, Logistic Regression was included because it is commonly used as a reference and it differs from the ensemble- based methods. All baseline models from the scikit-learn and LightGBM libraries were trained using the default parameters without any hyperparameter tuning. 25 3. Methods Model Accuracy Balanced Accuracy F1 Score Time (s) Extra Trees Classifier 0.829 0.576 0.786 0.632 LGBM Classifier 0.828 0.574 0.268 0.784 Random Forest Classifier 0.823 0.547 0.767 1.581 Ada Boost Classifier 0.821 0.529 0.754 1.090 XGB Classifier 0.820 0.571 0.780 0.202 Linear Discriminant Analysis 0.815 0.538 0.759 0.079 Calibrated Classifier CV 0.813 0.500 0.730 0.146 Dummy Classifier 0.813 0.500 0.730 0.017 Support Vector Classifier 0.813 0.500 0.730 1.113 Ridge Classifier 0.813 0.514 0.742 0.027 Table 3.5: Performance of the top 10 traditional classifiers based on accuracy using Lazy Predict. 3.5 Architecture and Hyperparameter Search for Transformer Model We decided to use fully learnable positional encodings, as they outperformed sinu- soidal encodings on various classification tasks [11]. To find a transformer encoder architecture that works well with the dataset we performed a two-stage hyperparam- eter search, starting with a broad random search, followed by a local exhaustive grid search. We used 5-fold cross-validation in both phases to reduce variance and assess generalization. Each fold was trained for up to 200 epochs using early stopping with patience of 10 epochs. This setup aimed to allow models to converge while keeping computational efficiency. The training was conducted using the Adam optimizer with weight decay and a fixed learning rate. We empirically tested batch sizes of 16, 32, 64, and 128. This test concluded that a batch size of 64 offered the best trade-off between training speed and model stability. We began the hyperparameter search with a random search over a parameter space inspired by previous transformer-based regression and classification models [5] [10] [11]. The search spanned 6912 total combinations across model and training param- eters, summarized in Table 3.6. For the feedforward network within each encoder layer, we followed the approach in [10] by expanding its dimensionality to four times the hidden dimension. 26 3. Methods Parameter Name Parameter Type Search Space Number of heads Model 1, 2, 4, 8 Hidden dimensions Model 128, 256, 512 Number of layers Model 1, 2, 3, 4, 5, 6 Activation function Model GELU, ReLU Dropout Model 0.2, 0.3, 0.4, 0.5 Learning rate Training 1e-4, 1e-5, 1e-6 Weight decay Training 0, 1e-4, 1e-5, 1e-6 Table 3.6: Hyperparameter search space for random search with a total of 6912 combinations. After 1000 random samples, the best configuration found is shown in Table 3.7. This configuration served as the starting point for the local exhaustive grid search. Parameter Name Value Number of heads 1 Hidden dimensions 128 Number of layers 5 Activation function GELU Dropout 0.4 Learning rate 1e-4 Weight decay 1e-6 Table 3.7: Best random search hyperparameter configuration after 1000 iterations. To further improve the results we defined a local search space centered around the best random configuration. It was decided to keep the dropout fixed at 0.4 to reduce the number of combinations in the exhaustive search. The local exhaustive grid search covered 243 combinations shown in Table 3.8. Parameter Name Parameter Type Search Space Number of heads Model 1, 2, 4 Hidden dimensions Model 64, 128, 256 Number of layers Model 4, 5, 6 Activation function Model GELU, ReLU Dropout Model 0.4 Learning rate Training 3e-4, 1e-4, 3e-5 Weight decay Training 1e-5, 1e-6, 1e-7 Table 3.8: Hyperparameter search space for local exhaustive search with a total of 243 combinations. The best configuration from the local grid search is listed in Table 3.9. Compared to the random search configuration, it favors a slightly larger hidden dimension and replaces the GELU activation function with ReLU. 27 3. Methods Parameter Name Value Number of heads 1 Hidden dimensions 256 Number of layers 4 Activation function ReLU Dropout 0.4 Learning rate 3e-4 Weight decay 1e-7 Table 3.9: Best local exhaustive search hyperparameter configuration. These optimized hyperparameters were used to construct the final transformer en- coder architecture visualized in Figure 3.5. The model uses a single attention head and four encoder layers, each with a hidden dimensionality of 256. This configura- tion leads to a model with approximately 2.4 million trainable parameters. Figure 3.5: Final transformer encoder architecture based on parameters from the exhaustive search. 28 3. Methods 3.6 Model Evaluation All models, including the transformer encoder and the selected baseline classifiers, are evaluated using accuracy, balanced accuracy, weighted F1 score, and a custom cost metric, see Equation 3.2. In addition, two naive classifiers that always predict the same class are included as benchmarks. The non-ensemble transformer encoder models were each trained and evaluated five times independently, and their metrics were averaged to obtain more stable performance estimates. Total Cost = 1 n (1× FP + 10× FN) (3.2) The custom cost function was developed, inspired by the approach in [1] to account for the different costs of incorrect predictions. In their case, there were five classes, with misclassification costs ranging from 7 to 500, depending on the severity of the error. In the context of the binary classification task addressed in this thesis, the model may either over-predict or under-predict the RUL of the ECU. An over- prediction which corresponds to a False Negative (FN) as shown in Table 3.10 occurs when the model estimates that the ECU will last longer than its actual lifespan. This can result in unexpected component failure that can lead to vehicle breakdowns, towing requirements, delivery delays, and possible spoilage of transported goods. An under-prediction is equivalent to a False Positive (FP) which occurs when the model predicts that the ECU will fail before its true lifespan. This may lead to unnecessary replacement of ECUs, resulting in extra costs and resource waste. Although over- predictions are more costly, defining the exact cost ratio between the two types of errors is difficult. A rough estimation used in this work, assumes that over- predictions are around ten times more costly than under-predictions, see Table 3.11. Predicted Class 1 Predicted Class 0 Actual Class 1 TP FN Actual Class 0 FP TN Table 3.10: Outline of confusion matrix for binary classification. Predicted Class 1 Predicted Class 0 Actual Class 1 0 10 Actual Class 0 1 0 Table 3.11: Estimated cost ratio between FP and FN for the RUL of ECUs. The cost for TP and TN are zero. Based on the cost estimation for FP and FN provided in Table 3.11, a custom cost function is presented in Equation 3.2. To ensure a fair comparison, the cost function was normalized by dividing by the number of predictions, n. This normalization was necessary because the transformer model makes predictions based on sequential data, 29 3. Methods while the baseline models make predictions on individual data points within each sequence. 30 4 Results This section presents the results of the thesis, beginning with the performance of the baseline models, followed by the results of the transformer encoder models. The models are compared based on accuracy, balanced accuracy, weighted F1 score, and a custom cost metric. In addition, confusion matrices are included to provide more detailed insight into where the models make incorrect predictions. Finally, for the ensemble models, plots are presented to visualize the level of vote agreement among the models within each ensemble. 4.1 Results for Baseline Models Table 4.1 summarizes the performance of the baseline models, including simple naive classifiers. The models were evaluated using accuracy, balanced accuracy, weighted F1 score, and a custom cost metric. Two naive classifiers, which always predict a single class, serve as reference points. Model Accuracy Balanced Accuracy F1 Score Custom Cost Logistic Regression 0.75 0.66 0.77 1.08 Random Forest 0.82 0.55 0.77 1.68 Extra Trees 0.83 0.57 0.78 1.58 LGBM 0.83 0.57 0.78 1.57 Naive positive 0.20 0.50 0.31 0.81 Naive negative 0.80 0.50 0 1.87 Table 4.1: Results of baseline models. This also includes naive classifiers which either predicts only healthy or only failed. To further assess the classification performance of the baseline models, confusion matrices were computed for each. Tables 4.2 through 4.5 show the distribution of true and false predictions for each model, broken down by actual and predicted classes. 31 4. Results Pred 1 Pred 0 Actual 1 179 176 Actual 0 292 1256 Table 4.2: Confusion matrix for Logistic Regression Pred 1 Pred 0 Actual 1 20 335 Actual 0 13 1535 Table 4.3: Confusion matrix for Model Random Forest Pred 1 Pred 0 Actual 1 56 299 Actual 0 25 1523 Table 4.4: Confusion matrix for Model Extra Trees Pred 1 Pred 0 Actual 1 60 295 Actual 0 33 1515 Table 4.5: Confusion matrix for Model LGBM 4.2 Results for Transformer Models Table 4.6 shows the performance of the transformer encoder models, comparing different hyperparameter search methods and both ensemble and non-ensemble ver- sions. The Parameter column indicates whether the model’s parameters were se- lected using a random or exhaustive search. The same evaluation metrics as for the baseline models are used: accuracy, balanced accuracy, weighted F1 score, and a custom cost. Two naive classifiers applied to the time series data are included as baselines to highlight how simple strategies compare to the transformer models. Model Ensemble Parameter Accuracy Balanced Accuracy F1 Score Custom Cost Transformer Encoder No Random search 0.80 0.78 0.80 1.05 Transformer Encoder Yes Random search 0.85 0.84 0.79 0.77 Transformer Encoder No Exhaustive search 0.80 0.79 0.80 0.97 Transformer Encoder Yes Exhaustive search 0.83 0.81 0.75 0.90 Naive Positive - - 0.34 0.50 0.18 0.66 Naive Negative - - 0.66 0.50 0.52 3.44 Table 4.6: Results of transformer encoder models and naive classifiers on the test set. 32 4. Results Similar to the baseline models, confusion matrices were also computed for the trans- former models. Table 4.7 presents the average confusion matrix for the non-ensemble model using random search parameters, averaged over five runs. The confusion ma- trix for the best-performing ensemble model with random search parameters is shown in Table 4.9. Similarly, Table 4.8 displays the average confusion matrix for the non- ensemble model with exhaustive search parameters. The corresponding confusion matrix for the ensemble model with exhaustive search parameters is presented in Table 4.10. Pred 1 Pred 0 Actual 1 21.8 8.2 Actual 0 9.4 47.6 Table 4.7: Average confusion ma- trix over 5 runs for non-ensemble Transformer model with random search parameters. Pred 1 Pred 0 Actual 1 22.6 7.4 Actual 0 10 47 Table 4.8: Average confusion ma- trix over 5 runs for non-ensemble Transformer model with exhaus- tive search parameters. Pred 1 Pred 0 Actual 1 24 6 Actual 0 7 50 Table 4.9: Confusion matrix for ensemble Transformer model with random search parameters. Pred 1 Pred 0 Actual 1 23 7 Actual 0 8 49 Table 4.10: Confusion matrix for ensemble Transformer model with exhaustive search parameters. To further investigate the behavior of the ensemble models, Figures 4.1 and 4.2 show the number of correct and incorrect predictions based on how many individual models agreed on the outcome. Figure 4.1 illustrates this for the ensemble model with random search parameters, while Figure 4.2 presents the same analysis for the exhaustive search variant. It can be observed in both cases that prediction accuracy increases as the number of agreeing models increases. 33 4. Results Figure 4.1: Correct and incorrect predictions made by the ensemble model from the random search, grouped by the number of agreeing models (three, four, or five). Figure 4.2: Correct and incorrect predictions made by the ensemble model from the exhaustive search, grouped by the number of agreeing models (three, four, or five). 34 5 Discussion In this final chapter, we discuss the performance of the models based on the re- sults presented in the previous chapter and address the three subgoals of the thesis. Finally, we present directions for future work and conclude the thesis. 5.1 Comparison of Model Performance The first subgoal SG1 of the thesis was to evaluate and compare the transformer encoder model against traditional machine learning models. Before comparing the results, it is important to note that even though the data is the same for both the tabular (see Table 3.2) and sequential (See Table 3.3) datasets, the tabular dataset has a bigger class imbalance. This is because the sequences for the minority class are shorter, which makes the imbalance worse. In Table 4.1, we observe that among the baseline models logistic regression had the lowest overall accuracy at 0.75, falling below the naive negative classifier. However, it achieved the highest balanced accuracy at 0.66 and only had 176 false negative predictions (see Table 4.2) which is much less than any other baseline model, leading to a custom cost of 1.08 which is the lowest among the baseline models. In contrast, the Random Forest model achieved an accuracy of 0.82 which is higher than the logistic regression model. However, as seen in Table 4.3 it performed poorly on the positive cases leading to 335 false negatives compared to the 176 of the lo- gistic regression. False negatives are the most costly incorrect prediction leading to the highest custom cost of 1.87 of any of the baseline models. Extra Trees and LGBM produced very similar results across all metrics. Both achieved an accuracy of 0.83 which was the highest of the baseline models. They also had identical balanced accuracy of 0.57 and weighted F1 score of 0.78. The only difference was in the custom cost, where LGBM had a slightly lower value of 1.57 compared to Extra Trees’ 1.58, due to a lower number of false negatives, as shown in Tables 4.4 and 4.5. The transformer encoder model’s results, shown in Table 4.6, were better than the baseline models across all standard metrics. However, on the custom cost metric the naive positive classifier obtained a cost lower than any of the transformer encoder 35 5. Discussion models. Based on the custom cost this would suggest that the best strategy for handling used ECUs is to always discard them instead of reinstalling them. For a model to be useful in the real world, it needs to perform better than the naive classifiers on this metric, which none of our models achieve. One reason the models do not reach a lower cost is that they are trained to maximize accuracy rather than minimize the custom cost. An important point is that the cost ratio between false positives and false negatives is only a rough estimate, and more work is needed to make it more accurate. To address subgoal SG2, we aimed to optimize model performance by tuning the hy- perparameters of a transformer encoder model. We used both a random grid search and a local exhaustive grid search. The random search helped us explore a wide range of parameter combinations with lower computational costs and was expected to provide a solid starting point. The exhaustive search was then applied with the idea to locally optimize this result and hopefully improve performance further. However, as shown in Table 4.6, there was no improvement in either accuracy or weighted F1 score for the single transformer model from the exhaustive search com- pared to the random search model. Only small improvements were seen in bal- anced accuracy and the custom cost when using the parameters from the exhaustive search for the single model. The confusion matrices in Tables 4.7 and 4.8 show that the models make similar predictions. However, the random search model pro- duces more over-predictions and fewer under-predictions compared to the exhaustive search model. Since over predictions are ten times more costly than under predic- tions this is what leads to the small difference in custom cost. When comparing the two ensemble models, the one from the random search actually performed better than the one from the exhaustive search. From the confusion ma- trices in Tables 4.9 and 4.10 it is clear that the random search model outperforms both in terms of under and over-predictions. This was unexpected since the exhaus- tive search model had performed better during cross-validation. We believe this may be due to the small size of the dataset, which might have caused the exhaustive search to overfit the training data. We tried to prevent this by using techniques like ensemble learning and early stopping, but overfitting may still have occurred. The third subgoal SG3, was to explore whether ensemble methods can improve the performance of the transformer encoder model. The results in Table 4.6 show that ensemble models performed better than the single models for both the random and exhaustive parameter search solutions. By looking at the ensemble model voting results in Figures 4.1 and 4.2, we can see that all five models in the ensemble agree on the final prediction in most cases. A large portion of these unanimous predictions are correct, which suggests that the models have a shared understanding of the data. The improved performance of the ensemble compared to a single model is due to in- stances where only three or four models agree on the prediction. In these cases, due to the way the data was split during cross-validation, different models can correctly 36 5. Discussion identify parts of the more difficult examples. This contributes to the overall better performance of the ensemble model compared to any single model. This is true for both the transformer encoder trained with random search and the one trained with exhaustive search parameters. 5.2 Future Work Continuing to collect more data where ECUs have failed and increasing the size of the dataset will likely improve the performance of the models. A larger dataset can help the models generalize better and make more accurate predictions. It would also be interesting to further explore additional operational features or truck-related data that could provide the models with more information. Another potential improvement is to experiment with sinusoidal positional encodings in the transformer encoder model. Although this thesis used learnable positional encodings, which were recommended for classification tasks based on the study by Zerveas et al. [11], the original transformer model by Vaswani et al. [10] used sinu- soidal encodings. The performance of different positional encoding techniques may also depend on the specific dataset, making this an area for further research. Including alternative optimization techniques, such as the AdamW optimizer or Stochastic Gradient Descent in the hyperparameter search could potentially im- prove the performance. Additionally, combining these optimizers with learning rate schedulers may lead to faster convergence and more stable training performance. Beyond exploring other hyperparameters, further improvements can come from ex- ploring advanced hyperparameter search methods. Techniques like Bayesian Opti- mization that guide the search using a probabilistic model or Genetic Algorithms that evolve parameter settings over time could lead to stronger models. Another direction for future research is to apply other time sequential models, such as Long Short-Term Memory, Gated Recurrent Unit, and Recurrent Neural Net- works to this dataset. This would help assess the performance of the transformer encoder used in this work. 5.3 Conclusion In this thesis, several transformer encoder models were developed and trained to classify the RUL of ECUs. These models were evaluated and compared against four traditional machine learning models and two naive classifiers. The transformer models outperformed the traditional machine learning models in terms of accuracy, balanced accuracy, weighted F1 score, and the custom metric. Additionally, ensem- ble transformer models showed better performance than single transformer models. However, the transformer models performed worse than a naive classifier when eval- uated using the custom cost metric. Further research on this dataset is therefore 37 5. Discussion needed before the models can be applied in real-world operations. This work serves as a starting point for improving decision-making in ECU refurbishment, especially when assessing their reuse in Volvo trucks. 38 Bibliography [1] T. Lindgren, O. Steinert, O. Andersson Reyna, Z. Kharazian, and S. Magnús- son, SCANIA Component X Dataset: A Real-World Multivariate Time Series Dataset for Predictive Maintenance (Version 2), Dataset, 2024. doi: https: //doi.org/10.5878/jvb5-d390. [2] G. Sateesh Babu, P. Zhao, and X.-L. Li, “Deep convolutional neural network based regression approach for estimation of remaining useful life,” in Database Systems for Advanced Applications, S. B. Navathe, W. Wu, S. Shekhar, X. Du, X. S. Wang, and H. Xiong, Eds., Cham: Springer International Publishing, 2016, pp. 214–228, isbn: 978-3-319-32025-0. [3] X. Li, Q. Ding, and J.-Q. Sun, “Remaining useful life estimation in prognostics using deep convolution neural networks,” Reliability Engineering & System Safety, vol. 172, pp. 1–11, 2018, issn: 0951-8320. doi: https://doi.org/10. 1016/j.ress.2017.11.021. [4] D. Chen, W. Hong, and X. Zhou, “Transformer network for remaining use- ful life prediction of lithium-ion batteries,” IEEE Access, vol. 10, pp. 19 621– 19 628, 2022. doi: https://doi.org/10.1109/ACCESS.2022.3151975. [5] O. Ogunfowora and H. Najjaran, A transformer-based framework for multi- variate time series: A remaining useful life prediction use case, 2023. doi: https://doi.org/10.48550/arXiv.2308.09884. [6] C. Ferreira and G. Gonçalves, “Remaining useful life prediction and challenges: A literature review on the use of machine learning methods,” Journal of Man- ufacturing Systems, vol. 63, pp. 550–562, 2022, issn: 0278-6125. doi: https: //doi.org/10.1016/j.jmsy.2022.05.010. [7] F. Ahmadzadeh and J. Lundberg, “Remaining useful life estimation: Review,” International Journal of System Assurance Engineering and Management, vol. 5, no. 4, pp. 461–474, 2014, issn: 0976-4348. doi: https://doi.org/10.1007/ s13198-013-0195-0. [8] D. K. Frederick, J. A. DeCastro, and J. S. Litt, “User’s guide for the com- mercial modular aero-propulsion system simulation (C-MAPSS),” Tech. Rep., 2007. [9] Y. Cheng, J. Wu, H. Zhu, S. W. Or, and X. Shao, “Remaining useful life prognosis based on ensemble long short-term memory neural network,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–12, 2021. doi: https://doi.org/10.1109/TIM.2020.3031113. [10] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Ad- vances in neural information processing systems, vol. 30, 2017. [Online]. Avail- 39 https://doi.org/https://doi.org/10.5878/jvb5-d390 https://doi.org/https://doi.org/10.5878/jvb5-d390 https://doi.org/https://doi.org/10.1016/j.ress.2017.11.021 https://doi.org/https://doi.org/10.1016/j.ress.2017.11.021 https://doi.org/https://doi.org/10.1109/ACCESS.2022.3151975 https://doi.org/https://doi.org/10.48550/arXiv.2308.09884 https://doi.org/https://doi.org/10.1016/j.jmsy.2022.05.010 https://doi.org/https://doi.org/10.1016/j.jmsy.2022.05.010 https://doi.org/https://doi.org/10.1007/s13198-013-0195-0 https://doi.org/https://doi.org/10.1007/s13198-013-0195-0 https://doi.org/https://doi.org/10.1109/TIM.2020.3031113 Bibliography able: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [11] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff, A transformer-based framework for multivariate time series representation learn- ing, 2020. doi: https://doi.org/10.48550/arXiv.2010.02803. [12] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001. [13] L. Breiman, J. Friedman, R. A. Olshen, and C. J. Stone, Classification and regression trees. Routledge, 2017. [14] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine learning, vol. 63, pp. 3–42, 2006. [15] G. Ke, Q. Meng, T. Finley, et al., “LightGBM: A Highly Efficient Gradi- ent Boosting Decision Tree,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, et al., Eds., vol. 30, Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips. cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa- Paper.pdf. [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni- tion,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi: https://doi.org/10.1109/CVPR.2016.90. [17] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feed- forward neural networks,” in Proceedings of the Thirteenth International Con- ference on Artificial Intelligence and Statistics, Y. W. Teh and M. Titterington, Eds., ser. Proceedings of Machine Learning Research, vol. 9, Chia Laguna Re- sort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 249–256. [Online]. Available: https://proceedings.mlr.press/v9/glorot10a.html. [18] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhut- dinov, Improving neural networks by preventing co-adaptation of feature detec- tors, 2012. [Online]. Available: https://arxiv.org/abs/1207.0580. [19] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014, issn: 1532-4435. [20] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. [21] A. Krogh and J. Hertz, “A simple weight decay can improve generalization,” in Advances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. Lippmann, Eds., vol. 4, Morgan-Kaufmann, 1991. [Online]. Available: https : / / proceedings . neurips . cc / paper _ files / paper / 1991 / file / 8eefcfdf5990e441f0fb6f3fad709e21-Paper.pdf. [22] L. Prechelt, “Early stopping - but when?” In Neural Networks: Tricks of the Trade, G. B. Orr and K.-R. Müller, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 55–69, isbn: 978-3-540-49430-0. doi: https://doi.org/ 10.1007/3-540-49430-8_3. [23] O. Sagi and L. Rokach, “Ensemble learning: A survey,” English, Wiley In- terdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, Jul. 2018, Publisher Copyright: ľ 2018 Wiley Periodicals, Inc., issn: 1942-4787. doi: https://doi.org/10.1002/widm.1249. 40 https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf https://doi.org/https://doi.org/10.48550/arXiv.2010.02803 https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf https://doi.org/https://doi.org/10.1109/CVPR.2016.90 https://proceedings.mlr.press/v9/glorot10a.html https://arxiv.org/abs/1207.0580 http://www.deeplearningbook.org https://proceedings.neurips.cc/paper_files/paper/1991/file/8eefcfdf5990e441f0fb6f3fad709e21-Paper.pdf https://proceedings.neurips.cc/paper_files/paper/1991/file/8eefcfdf5990e441f0fb6f3fad709e21-Paper.pdf https://doi.org/https://doi.org/10.1007/3-540-49430-8_3 https://doi.org/https://doi.org/10.1007/3-540-49430-8_3 https://doi.org/https://doi.org/10.1002/widm.1249 Bibliography [24] M. J. Azur, E. A. Stuart, C. Frangakis, and P. J. Leaf, “Multiple imputation by chained equations: What is it and how does it work?” International journal of methods in psychiatric research, vol. 20, no. 1, pp. 40–49, 2011. [25] Pyodbc, 2008. [Online]. Available: https://pypi.org/project/pyodbc/. [26] E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools for Python, 2001–. [Online]. Available: http://www.scipy.org/. [27] C. R. Harris, K. J. Millman, S. J. van der Walt, et al., “Array programming with NumPy,” Nature, vol. 585, no. 7825, pp. 357–362, Sep. 2020. doi: https: //doi.org/10.1038/s41586-020-2649-2. [28] W. McKinney et al., “Data structures for statistical computing in python,” in Proceedings of the 9th Python in Science Conference, Austin, TX, vol. 445, 2010, pp. 51–56. [29] J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007. doi: https://doi.org/10.1109/ MCSE.2007.55. [30] M. L. Waskom, “Seaborn: Statistical data visualization,” Journal of Open Source Software, vol. 6, no. 60, p. 3021, 2021. doi: https://doi.org/10. 21105/joss.03021. [31] F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learn- ing in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [32] A. Paszke, S. Gross, F. Massa, et al., PyTorch: An imperative style, high- performance deep learning library, 2019. arXiv: 1912.01703 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1912.01703. [33] S. Pandala, Lazy Predict, Accessed: 2025-04-17, 2021. [Online]. Available: https: //github.com/shankarpandala/lazypredict. [34] A. Tawakuli, B. Havers, V. Gulisano, D. Kaiser, and T. Engel, “Survey:time- series data preprocessing: A survey and an empirical analysis,” Journal of Engineering Research, 2024, issn: 2307-1877. doi: https://doi.org/10. 1016/j.jer.2024.02.018. [35] J. N. Wulff and L. Ejlskov, “Multiple imputation by chained equations in praxis: Guidelines and review,” Electronic Journal of Business Research Meth- ods, vol. 15, no. 1, pp41–56, 2017. [36] L. B. de Amorim, G. D. Cavalcanti, and R. M. Cruz, “The choice of scal- ing technique matters for classification performance,” Applied Soft Computing, vol. 133, p. 109 924, 2023, issn: 1568-4946. doi: https://doi.org/10.1016/ j.asoc.2022.109924. 41 https://pypi.org/project/pyodbc/ http://www.scipy.org/ https://doi.org/https://doi.org/10.1038/s41586-020-2649-2 https://doi.org/https://doi.org/10.1038/s41586-020-2649-2 https://doi.org/https://doi.org/10.1109/MCSE.2007.55 https://doi.org/https://doi.org/10.1109/MCSE.2007.55 https://doi.org/https://doi.org/10.21105/joss.03021 https://doi.org/https://doi.org/10.21105/joss.03021 https://arxiv.org/abs/1912.01703 https://arxiv.org/abs/1912.01703 https://github.com/shankarpandala/lazypredict https://github.com/shankarpandala/lazypredict https://doi.org/https://doi.org/10.1016/j.jer.2024.02.018 https://doi.org/https://doi.org/10.1016/j.jer.2024.02.018 https://doi.org/https://doi.org/10.1016/j.asoc.2022.109924 https://doi.org/https://doi.org/10.1016/j.asoc.2022.109924 Bibliography 42 List of Figures List of Tables List of Abbreviations Introduction Related Work Purpose Scope Outline of Report Theory Baseline Models Logistic Regression Random Forest Classifier Extremely Randomized Trees Classifier Light Gradient Boosting Machine Classifier Transformer Model Input Embedding and Positional Encoding Encoder Architecture Self-Attention and Multi-Head Attention Feed-Forward Network Final Layer for Classification Regularization Techniques in Neural Networks Dropout Weight Decay Early stopping Ensemble Learning Metrics for Model Evaluation Accuracy and Balanced Accuracy Weighted F1 Score Multiple Imputation by Chained Equations Methods Tools Data Exploration and Compilation Data Preprocessing Steps Data Cleaning and Aggregation Overview of Data Classes Dataset overview Handling Missing Values with MICE Rescaling Data with Robust Scaler Selecting Baseline Models Using Lazy Predict Architecture and Hyperparameter Search for Transformer Model Model Evaluation Results Results for Baseline Models Results for Transformer Models Discussion Comparison of Model Performance Future Work Conclusion Bibliography