Using Machine Learning for Accelerating the Detection Time of Unreliable Failure Detectors Master’s thesis in Computer science and engineering CHENGBO REN Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Using Machine Learning for Accelerating the Detection Time of Unreliable Failure Detectors CHENGBO REN Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Using Machine Learning for Accelerating the Detection Time of Unreliable Failure Detectors CHENGBO REN © CHENGBO REN, 2024. Supervisor: Elad Michael Schiller, Department of Computer Science and Engineering Advisor: Hasan Mahmood, Nexer Group Examiner: Romaric Duvignau, Department of Computer Science and Engineering Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2024 iv Using Machine Learning for Accelerating the Detection Time of Unreliable Failure Detectors CHENGBO REN Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract This study evaluates Machine Learning-based Failure Detectors for detecting faulty nodes in distributed systems, focusing on metrics like Detection Time and Average Mistake Rate. Random Forest and Support Vector Machine Failure Detectors emerge as top performers. However, Machine Learning-based Failure Detectors exhibit higher error rates, especially with longer detection times. To address this, three ensemble methods are proposed, with one integrating top Machine Learning- based Failure Detectors and analytical Failure Detectors for better balance. The Precision Distribution module improves accuracy but struggles with completeness. One analytical Failure Detector is used as a fallback to address completeness concerns, although its effectiveness depends on the relationship between the analytical Failure Detector and Precision Distribution module thresholds. Additionally, the Li-Marin Long Short-Term Memory Failure Detector shows reduced detection time but with increased computational overhead, posing challenges in fine-tuning overestimation levels. Keywords: Failure Detector, Machine Learning, Distributed Systems, Ensemble Methods. v Acknowledgements On finishing the last word of my thesis report, I see a heart filled with gratitude beating in my chest. I first need to thank my supervisor, Elad Michael Schiller, who guided me through the whole project with great patience and wisdom on not only academic research but also how to be a strong and honest man. His passion and devotion to his work have set a standard for me to reach regardless of my career choice in the future. I am so grateful to have him as my supervisor. I extend my sincere appreciation to my company advisor, Hasan Mahmood. Hasan’s practical insights, expert advice, and willingness to share his knowledge have played a pivotal role in enriching the relevance and applicability of my research within the corporate context. His support and encouragement have provided me with invaluable perspectives and guidance, enabling me to navigate the complexities of integrating academic theory with real-world practices effectively. To my beloved wife and daughters, I owe an immeasurable debt of gratitude for their unwavering love, encouragement, and support throughout my thesis journey. Their constant belief in my abilities, unwavering patience during challenging times, and endless encouragement have been the cornerstone of my academic achievements. Whether it was offering words of wisdom, lending a listening ear, or providing practical assistance, my family’s steadfast support has been the driving force behind my perseverance and success. I am profoundly grateful for their sacrifices, understanding, and unwavering belief in my dreams. Chengbo Ren, Gothenburg, March 2024 vii Contents List of Acronyms xi List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Motivation and Related Work . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contribution of This Work . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Methods 5 2.1 Crash Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Unreliable Failure Detectors . . . . . . . . . . . . . . . . . . . 6 2.2.2 Eventually Perfect Failure Detectors . . . . . . . . . . . . . . 6 2.2.3 Heartbeat Failure Detector . . . . . . . . . . . . . . . . . . . . 6 2.3 Analytical Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 BDBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.2 TTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4.1 Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.2 LSTM for Regression . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Data Points Generation . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6.1 Parameter Module . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6.2 Timed and Time-free Windows . . . . . . . . . . . . . . . . . 11 2.6.3 Generating Data Points . . . . . . . . . . . . . . . . . . . . . 12 2.7 Precision-Distribution Module . . . . . . . . . . . . . . . . . . . . . . 12 2.8 Data Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Implementation 15 3.1 The LSTM-based Unreliable Failure Detector by Li and Marin . . . . 15 3.1.1 Li-Marin Solution Overview . . . . . . . . . . . . . . . . . . . 15 ix Contents 3.1.2 Our Implementation with Simulative Approaches . . . . . . . 16 3.1.3 Advantages and Disadvantages Analysis . . . . . . . . . . . . 17 3.2 ML-based Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 ML Models Construction . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 ML-based FDs with Analytical Fallback . . . . . . . . . . . . 18 3.3 Ensemble Failure Detectors . . . . . . . . . . . . . . . . . . . . . . . 18 4 Evaluation 19 4.1 Evaluation Environment . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.1 Classification Metrics . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.2 Time and Accuracy Metrics . . . . . . . . . . . . . . . . . . . 22 4.3 Experiment Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.1 Experiment suites . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.2 Experiment thresholds . . . . . . . . . . . . . . . . . . . . . . 23 4.3.3 Experiment configurations . . . . . . . . . . . . . . . . . . . . 24 4.4 Clustering and Ranking ML-based FDs . . . . . . . . . . . . . . . . . 25 4.5 Li-Marin LSTM FD Evaluation . . . . . . . . . . . . . . . . . . . . . 25 5 Results 27 5.1 ML-based FDs without PD Module . . . . . . . . . . . . . . . . . . . 27 5.2 ML-based FDs with PD Module and Fallback . . . . . . . . . . . . . 28 5.3 Clustering ML-Based Binary Classifiers . . . . . . . . . . . . . . . . . 34 5.3.1 Crash Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3.2 Crash-free Intervals . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.3 Ranking of ML-Based Binary Classifiers . . . . . . . . . . . . 36 5.3.3.1 Pairwise Comparison . . . . . . . . . . . . . . . . . . 36 5.4 Li-Marin LSTM FD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.5 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.5.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6 Conclusion 41 Bibliography 43 A Additional Results I A.1 Clustering results of K-means . . . . . . . . . . . . . . . . . . . . . . I x List of Acronyms ANN Artificial Neural Network. 8, 17 AUC Area Under Curve. 17, 22, 28, 30, 38 BDBD Blanchard, Dolev, Beauquier, Delaët. xv, 3, 7, 18, 19, 23–25, 27, 28, 30–33, 38, 39, 41 FD Failure Detector. xiii, xv, 1–7, 9, 11, 12, 15, 17–20, 22–33, 36–39, 41 FDR False Discovery Rate. 21, 38 FLP Fischer-Lynch-Paterson. 2 FN False Negative. xiii, xv, 21, 25–29, 34, 35, 37, 38 FOR False Omission Rate. 21, 29, 38 FP False Positive. xiii, xv, 12, 21, 25–29, 36–38 LSTM Long Short-Term Memory. xiii, xv, 1–4, 8–10, 15–17, 19, 25–27, 36–38, 41 ML Machine Learning. xiii, 1–4, 17–20, 22–25, 27–31, 34–36, 38, 39, 41 PD Precision Distribution. xiii, xv, 5, 12, 18, 23–25, 27, 28, 30, 33–35, 38, 41 PR Precision Recall. 17, 21, 22 QoS Quality of Service. 22, 27 ROC Receiver Operating Characteristic. 17, 21, 22, 28, 30, 38 RTC Round-Trip Count. 11, 12 SVM Support Vector Machine. xv, 17, 19, 27–30, 32, 33, 36–38, 41 TN True Negative. 21, 29, 38 TP True Positive. 12, 21, 29, 35, 38 xi List of Acronyms TSH Time Since (last) Heard. 11, 12 TTF Timed and Time-Free. 7, 19, 23, 24 xii List of Figures 2.1 The architecture of the LSTM model. . . . . . . . . . . . . . . . . . . 9 2.2 The structure of the LSTM cell. . . . . . . . . . . . . . . . . . . . . . 10 2.3 The architecture of stacking. . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 A window W with a size of 20 samples, note that the system does not know about the samples outside of W. . . . . . . . . . . . . . . . . . 12 3.1 Monitored nodes keep sending heartbeats to the monitoring node. . . 15 3.2 Estimating the arrival time of the next heartbeat to suspect failures. . 16 3.3 A failure detector combines the ML-based FD with analytical fallback. 18 4.1 The chronological order of the “crash” and “non-crash” intervals. . . . 20 4.2 A confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 The places where FP/FN errors may occur in Li-Marin LSTM FD. . 26 5.1 Visualization of the FP and FN values for different ML classifiers. . . 29 5.2 Visualization of the classification metrics values for different ML classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3 The trade-off between TD and λM for the results of CP D without fallback. 34 5.4 The trade-off between TD and λM for the results of Cfallback,higher and CP D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.5 The trade-off between TD and λM for the results of Cfallback,equal and CP D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.6 The trade-off between TD and λM for the results of Cfallback,lower and CP D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 xiii List of Figures xiv List of Tables 4.1 Metrics derived from a confusion matrix. . . . . . . . . . . . . . . . . 21 4.2 Various values of thresholds . . . . . . . . . . . . . . . . . . . . . . . 24 5.1 The confusion matrix data results from evaluating the classifiers on the test set. Refer to Section 4.2.1 for the definition of these metrics. 29 5.2 The ratio x/X − 1 is evaluated for each FD, where x is the number of FP or FN, and X is the best FP, and respectively, FN value among all FDs. The best FP value is the SVM FD (marked with ∗), and the best FN value is the Neural Network FD (marked with ⋄). . . . . . . 29 5.3 The classification matrics derived from the confusion matrix data. . . 30 5.4 The ratio (1−x)/(1−X)−1 is evaluated for each classification metric, where X represents the best value within a specific column. Note that the best Fall-Out value corresponds to the smallest value, whereas for other metrics, the best value is the largest one. . . . . . . . . . . . . . 30 5.5 The specificity attained on the test set when employing a configuration with a threshold that yielded a specificity of t in the training set. . . 32 5.6 The detection time (TD) of each FD relative to the TD of the BDBD solution for attaining specific values of the average mistake rate (λM) in the configurations: CP D, Cfallback,higher, Cfallback,equal, and Cfallback,lower. AB, NN, RF, LR and DT are the abbreviations of Adaboost, Neural Network, Random Forest, Logistics Regression and Decision Tree respectively. . . . . . . . . . . . . . . . . . . . . . . . . 33 5.7 The pairwise comparison of classifiers Random Forest, SVM and Decision Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.8 The comparisons of classifiers SVM vs Adaboost and Random Forest vs Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.9 The confusion matrix data results of LSTM classifier and LSTM FD. 38 5.10 The PA and classification metrics results of LSTM classifier FD and the Li-Marin LSTM FD at lower and higher degree of over-estimation. 38 A.2 The clustering outcomes for seven classifiers. . . . . . . . . . . . . . . IV A.4 The clustering outcomes for seven classifiers. . . . . . . . . . . . . . . VIII xv List of Tables xvi 1 Introduction Failure detectors (FDs) play a crucial role in identifying faulty nodes in distributed computing models. While analytical FDs can achieve remarkable accuracy, they usually need long detection times. To address this issue of prolonged detection times associated with analytical FDs, people integrate machine learning (ML) methods to balance the accuracy and the detection time. This thesis project primarily concentrates on ML-based FDs. Section 1.1 provides an overview of fundamental concepts and challenges in distributed computing, as well as the significance of FDs. Following this, Section 1.2 and Section 1.3 briefly explain the previous work on FDs using ML classifiers and the Long Short-Term Memory (LSTM) model. We also describe our primary objective, which is to determine the suitable ML-based FDs for various scenarios. Our contributions are discussed in Section 1.4. Lastly, Section 1.5 outlines the structure of this entire thesis report. 1.1 Background A distributed system is made up of a collection of distributed computing units, often referred to as entities. In the context of this thesis, we use a process or a node to represent an entity. These entities are interconnected by a communication medium (usually channels), constituting an abstraction of a physically distributed system. Entities may have crash failures and Byzantine failures [1]. Intuitively, When an entity stops its execution prematurely, we say it encounters the crash failure. When an entity does not follow the behavior assigned to it, we say it encounters the Byzantine failure. We only focus on crash failure in this thesis. A channel can be synchronous or asynchronous and reliable or unreliable. Synchronous means that there is an upper bound on message transfer delays, while asynchronous means there is no such bound. Reliable means there is no message loss, creation, modification, or duplication, while unreliable means there is no such guarantee. Distributed computing focuses on a problem in terms of distributed entities with many parameters involved, but each entity has only a partial knowledge of these parameters. To solve such a problem, distributed entities exchange messages via the channels. The problem in the world of distributed computing has a set of properties such as liveness (sometimes called termination), safety (sometimes called validity, agreement), etc. The liveness property indicates the assurance that a problem yields a result, and the 1 1. Introduction safety properties establish safeguards ensuring that specific conditions must never be violated. We say that an algorithm solves a problem in the distributed computing model if, assuming the inputs are correct, any run of the algorithm in the distributed computing model satisfies all the properties of the problem. For example, consider the problem of reaching consensus [2] for several entities over the asynchronous and unreliable channels that have message loss problem. To simplify this problem, we assume that the involved entities do not have crash failure or Byzantine failure, and the messages sent over the channel might be lost, but no other errors. The safety properties of this problem state the correct outputs of an algorithm: all entities should agree, and in no cases two or more entities disagree. The liveness property states that all entities eventually and permanently enter into a done state. Whether an algorithm exists to solve problems within the distributed computing model, such as the consensus problem, has become a question of interest to many. Fischer, Lynch, and Paterson delved into the consensus problem in a distributed computing model, where entities can only have crash failures and the channels are both asynchronous and reliable. They posited the FLP impossibility in [3], saying that it is impossible to design a deterministic consensus algorithm that can simultaneously guarantee safety, liveness, and fault tolerance in an asynchronous system. One prominent approach to circumvent the FLP impossibility is the introduction of the failure detector (FD) concept, as proposed by Chandra and Toueg in [4]. Functionally, a failure detector can be seen as an oracle made up of several modules, with each module being associated with a process (an entity). These modules offer hints and insights into the respective processes regarding the failures of other processes within the system. Failure detectors (FDs) are categorized into different classes, each class dedicated to the particular type of failure information they provide. The choice of a particular class of failure detector is contingent on the characters of different problems within the context of a fault-prone asynchronous distributed system model. 1.2 Problem Statement Since Chandra and Toueg first formally defined unreliable failure detectors in [4], many analytical failure detectors that rely on analytical or mathematical models to make predictions have been developed. For instance, there is the ♢P failure detector that is based on Blanchard et al. [5], the Ω failure detector based on algorithms by Aguilera et. al [6], as well as Mostefaoui, Mourgaya, and Raynal [7], or their combination, as described at [8], Chapter 18.9. While analytical FDs can achieve impressive accuracy, they usually need long detection times. Consequently, people integrate machine learning (ML) into the implementation of FDs to balance between detection time and accuracy. Häger and Köre studied how to use ML classifiers to identify the fault entities [9]. In their work, Häger and Köre compared ML-based FDs with an analytical FD developed by Blanchard et al. [5]. Furthermore, Li and Marin [10] proposed a timeout-based heartbeat failure detector that utilizes the LSTM model [11]. The results of Häger and Köre show that although FDs based on different ML classifiers consistently achieve an accuracy rating exceeding 0.999, along with a 2 1. Introduction specificity surpassing 0.999, they exhibit varying levels of accuracy. The best classifier among all is the Random Forest, achieving an accuracy of 0.99999 and a specificity of 0.9999. Meanwhile, the Random Forest classifier has the ability to reduce detection time by as much as 86.3% compared to the analytical solution proposed by Blanchard et al. [5], while maintaining an error rate of 1 × 10−6. The result of Li and Marin also shows that the LSTM-based FD achieves much shorter detection times for a similar probability of availability than Chen’s analytical solution [12], but at a significant computation cost. In previous studies, the focus was mainly on comparing a single ML-based FD with a conventional analytical FD, without delving into a comprehensive comparison among various ML-based FDs. In this project, our main objective is to implement and evaluate FDs based on different machine-learning methods as proposed by Häger-Köre and Li-Marin. Our comparison is not just a comparison of different classification metrics values like specificity, accuracy, sensitivity, etc (refer to table 4.1). Instead, we also conduct a pair-wise comparison of different FDs on different data sets. On top of that, we construct ensemble FDs by combining top-performing FDs, aiming to find the most efficient FD tailored to different scenarios. 1.3 Motivation and Related Work Earlier for this project, Häger and Köre [9] studied eight different kinds of statistical classifiers that aim at facilitating the implementation of ♢P . Specifically, Häger and Köre modeled failure detection as a binary classification problem and implemented FDs that use Machine Learning (ML) as the mechanism for failure detection. They employed several binary classifiers in the implementation of failure detectors to distinguish between correct or faulty entities and compared their implementation with the analytical FD by Blanchard et al. [5] (refer to as BDBD in this thesis) in terms of accuracy and detection time. Li and Marin [10] used a machine learning method different from classification to implement FDs. Their solution is to let entities send heartbeats to each other regularly, use the LSTM model to predict the arrival time of a heartbeat, and compare it with the actual arrival time to infer the status of an entity. To our knowledge, the integration of ML methods into FD implementation is still in the initial stage, and comparisons of FDs based on various ML methods are even rarer. We hope our research will provide insights into selecting the most suitable ML-based FDs across diverse scenarios. 1.4 Contribution of This Work This project continues the work conducted by Häger and Köre [9] in the field of FDs. Building upon their work, our contributions are as follows: • Evaluate and compare the performance of FDs that are based on various ML classifiers. We address the data leakage issue in Häger and Köre’s solution and 3 1. Introduction filter out FDs with obviously poor performance. • Combine multiple machine learning models to create a combination of models for the FD that outperforms any individual model. • Cluster FDs based on different ML classifiers during crash and non-crash intervals, particularly when these FDs report false negatives and false positives, respectively. • Implement and evaluate an unreliable failure detector using LSTM [11]. We evaluate its accuracy (using the probability of availability as an indicator) and the computation time. 1.5 Disposition This thesis comprises six chapters, each with its distinct focus and content. The structure of the thesis is as follows: Chapter 1 - Introduction: In this chapter, we provide essential background informa- tion, outline the research’s aims and objectives, and present a list of our contributions. Chapter 2 - Methods: This chapter covers basic concepts related to Fault Detectors (FDs) and Machine Learning (ML) as well as the application of ML methods in FDs. Chapter 3 - Implementation: In this chapter, we provide comprehensive details on the implementation aspects, including specifics about the LSTM-based unreliable FD and the cluster and comparison algorithm. Chapter 4 - Evaluation: This chapter elucidates the experiments conducted and the configurations employed for evaluating FDs. Chapter 5 - Results: The outcomes of our research are presented and discussed in this chapter, providing insights into our findings. Chapter 6 - Conclusion: This final chapter encapsulates the discussions and conclu- sions drawn from our research, summarizing the key takeaways and contributions made. Each chapter contributes to the overall narrative of the thesis, enabling a compre- hensive understanding of the research and its outcomes. 4 2 Methods This chapter provides an overview of concepts and methods used by failure detectors, along with insights into data point generation and PD modules developed by Häger and Köre. In Section 2.1, we present the formal definition of crash failures. Section 2.2 introduces the concepts of failure detectors (FDs), explores their properties of completeness and accuracy, and gives a definition of eventually perfect failure detectors based on specific completeness and accuracy. Within this section, we also introduce the heartbeat failure detector, a mechanism for detecting the status of a process by continuously sending heartbeat messages. Section 2.3 briefs two analytical FDs belonging to class ♢P . Moving forward, Section 2.4 delves into the machine learning methods for both binary classification and regression tasks. Section 2.5 introduces three ensemble methods used to combine the results of multiple machine learning models. Next, the procedure of generating data points and the PD module are described in Section 2.6 and Section 2.7, respectively. Lastly, Section 2.8 introduces the concepts of data leakage. 2.1 Crash Failures From the perspective of a failure detector, a process (an entity) is considered to be in one of two states: if a process terminates prematurely, it is called as being crashed or faulty, otherwise, it is considered to be correct or non-faulty. Chandra and Toueg utilized a discrete global clock, represented by the set of natural numbers, in the definition of the failure pattern. This failure pattern describes the processes that have crashed at any time. Formally, let T be the clock’s ticks and let Π be the set of all processes in the system, then the failure pattern F is a function from T to 2Π, where F (t) denotes the set of crashed processes up to time t. We define crashed(F ) = ∪t∈T F (t) and correct(F ) = Π − crashed(F ). Since once a process crashes, it does not “recover", we have ∀t: F (t) ⊆ F (t + 1). 2.2 Failure Detectors Failure Detectors (FDs) are an abstraction in distributed computing that help systems identify faulty processes in a distributed environment. The FD in distributed systems 5 2. Methods is a module attached to a process that provides information about which processes in the system have crashed. At any given time, a FD can either trust a process to be functioning correctly or suspect it to have crashed. The classes of FDs are differentiated by their properties of completeness and accuracy [4]. Completeness refers to the eventual suspicion of every process that has actually crashed, while accuracy refers to the number of correct decisions made by the FD regarding crashed processes. 2.2.1 Unreliable Failure Detectors The unreliable failure detector abstraction was introduced by Chandra and Toueg in [4]. A failure detector is unreliable in the sense that it may either fail to identify a faulty process or falsely identify a correct process as faulty. The failure detector module maintains a list of processes that are suspected to be faulty, with this list being dynamically updated to reflect the addition or removal of processes as needed. However, it is noteworthy that the failure detection module can produce errors in its operation, such as mistakenly suspecting a running process as faulty or vice versa. 2.2.2 Eventually Perfect Failure Detectors According to Chandra and Toueg, there are two kinds of completeness and four accuracy properties are defined, resulting in eight distinct classes of failure detectors. We are more interested in the eventually perfect failure detectors (♢P ), which is defined as follows: Definition 2.2.1 (Eventually Perfect Failure Detectors) A failure detector pro- vides each process pi a local variable, a set denoted trusted(i). Let trusted(i, t) be the value of trusted(i) at time t. Remember that correct(F ) denotes the set of non- faulty processes in a run, and crashed(F ) denotes the set of faulty processes. With these denotations, the eventually perfect failure detector is defined by the following properties: • Completeness: ∀F , ∃t ∈ T , ∀i ∈ crashed(F ), ∀j ∈ correct(F ), ∀t ′ ≥ t: i /∈ trusted(j, t ′) • Accuracy: ∀F , ∃t ∈ T , ∀i, j ∈ Π − F (t′), ∀t ′ ≥ t: i ∈ trusted(j, t ′) Intuitively, completeness means that every faulty process is suspected by every correct process after a certain period of time, and accuracy means that no non-faulty process is suspected after a certain period of time since the last crash occurs. 2.2.3 Heartbeat Failure Detector A heartbeat failure detector is a type of failure detector in distributed systems that utilizes the heartbeat mechanism to determine the liveness of network nodes. The timeout-based heartbeat failure detector [13] is achieved by periodically sending heartbeat messages. These heartbeat messages are typically sent continuously from the time of the originator’s startup until its shutdown. If the monitoring process does 6 2. Methods not receive a heartbeat message within a specified number of intervals, it assumes that the process that should have sent the message has failed, or crashed. The time-free heartbeat failure detector [14] simply keeps track of the total number of heartbeat messages received from each counterpart process. The monitoring process outputs a vector of counters, one for each monitored process. In the event that a monitored process does not crash, its counter will continue to increase without bounds. Conversely, if a monitored process crashes, its counter will stop increasing. 2.3 Analytical Failure Detectors Häger and Köre [9] introduced two analytical FDs belonging to class ♢P : BDBD and TTF. 2.3.1 BDBD The BDBD FD, inspired by Blanchard et al. [5], is named after its authors: Blanchard, Dolev, Beauquier, and Delaët. This FD suspects a process by counting the number of round-trip messages. Suppose the process pi ∈ P uses BDBD FD to detect the status of the process pj ∈ P\{pi}, it records the number of completed round-trips between pi and other processes pk ∈ P\{pi, pj} since the last completed round-trip between pi and pj . If pj falls behind other processes a predefined number of round-trips, then it is suspected by pi. To use the BDBD FD, we compute the value of the predefined number of round-trips that is needed to reach a desired specificity level for the training set. A certain value of this specificity is a threshold, referred to as θ parameter. 2.3.2 TTF The TTF (Timed and Time-Free) FD extends the BDBD FD by introducing a timer that tracks the time elapsed since the last completed round-trip between pi and pj. That is to say, if pj falls behind other processes a predefined number of round-trips and the duration since the last completed round-trip between pi and pj exceeds a predefined threshold, pi considers pj as suspected. The TTF FD incorporates both round trips and a timer. To achieve a desired specificity level for the training data using TTF, we must establish appropriate values for both the round-trips and the timer. Similar to BDBD, a specific threshold value is defined for this specificity, referred to as the 2D probability threshold. 2.4 Machine Learning In the context of this project, we employ machine learning models to perform classification and regression. Classification aims to assign an input vector x to one of K discrete classes denoted as Ck, where k = 1, 2, . . . , K. Analogously, the purpose of regression is to predict one or more continuous target variables t based on the 7 2. Methods values of a D-dimensional input vector x. For classification, please refer to Section 2.4.1, which provides a list of the classifiers used in this project. Additionally, Section 2.4.2 gives a brief introduction to the application of the Long Short-Term Memory (LSTM) model [11] for regression. 2.4.1 Binary Classifiers For the purpose of our project, we employ a range of binary classifiers in the implementation of failure detectors to distinguish between correct or faulty nodes. The classifiers are trained using an algorithm that presents it with input data or “data points”, along with a label indicating what the data represents. We will study multiple off-the-shelf classifiers rather than focus on optimizing a few. The machine learning classifiers that are used in our work are: • Decision Tree: a classifier that reaches its decision through the use of a tree, which is strategically constructed upon training [15]. • Random Forest: a classifier that constructs multiple Decision Trees during training, and then makes its decision based on the trees’ majority vote [15]. • AdaBoost: a classifier, which is a meta-estimator, is employed by first fitting the classifier on the original data set and then fitting additional copies on the same data set, but with the weights of the misclassified instances adjusted. This process helps to direct subsequent classifiers to focus more on challenging cases [16]. • Logistic Regression: a classifier that involves fitting a logistic function to a data set through the use of Stochastic Gradient Descent [17], [18]. The output is bound between 0 and 1 and represents the predicted probability of a given data point belonging to one of the two predefined classes [15]. • Naive Bayes: a classifier that is based on the assumption of independence between the parameters of a data point. Using Bayes’s Theorem, the classifier determines the most probable class for a new data point, based on observations obtained during training [19]. • Support Vector Machine(SVM): a classifier that maps the input vectors of two classes, representing the data points, into a multi-dimensional space and creates a decision boundary plane to distinguish between the two classes [20]. • Artificial Neural Network(ANN): a classifier that utilizes the principles of a Neural Network, in which the input vector is processed through the network, a classification can be made by looking at the values of the neurons in the output layer [15]. • Long Short-Term Memory(LSTM): The utilization of an LSTM model as a binary classifier involves the implementation of a single neuron in the output layer, equipped with a sigmoid activation function. The sigmoid function transforms the input into a continuous value between 0 and 1, which can be interpreted as a probability of the positive class. 8 2. Methods Figure 2.1: The architecture of the LSTM model. • FedDyn: FedDyn model [21] can be trained on data that is distributed across multiple devices. In the context of binary classification, the FedDyn model can be configured with a single neuron in the output layer, equipped with a sigmoid activation function. This configuration is similar to that of the Long Short-Term Memory (LSTM) model, which has been previously proposed for binary classification tasks. 2.4.2 LSTM for Regression Now we are going to discuss the application of LSTM for regression. As showed in Figure 2.1, the LSTM model takes the sequential data x(i) as input, generates O(i) as the outputs, with i ∈ {0, 1, . . . , T}. C(i) and H(i) are the cell states and the hidden states respectively, and their purpose is to transfer information between step i − 1 and step i. The LSTM’s capability to easily capture the data patterns is mainly attributed to its special cell structure. Figure 2.2 shows the structure of the LSTM cell. This cell consists of three gates: the forget gate, the input gate, and the output gate. The forget gate determines whether information from the preceding step should be preserved or discarded as irrelevant. Consequently, only the relevant preceding data points in the sequential data set are kept, and the irrelevant preceding data points are discarded. Meanwhile, the input gate tries to learn new information from the input to the cell. Finally, the output gate conveys the updated information from the current step to the subsequent one (C(i) and H(i)) and generates the output (O(i) = H(i)). In our application of the LSTM model to the heartbeat FD in Section 2.2.3, we input a sequence of consecutive heartbeat message arrival times into the LSTM model. Then this model outputs the prediction of the arrival time of the next heartbeat message. To clarify, suppose we have a set of consecutive heartbeat messages mi, where i ∈ {1, 2, . . . , N}, each associated with the arrival time ti. After feeding these arrival times (ti) into the LSTM model, it outputs Oi, which are the predicted arrival times of the subsequent heartbeat messages (mi+1). 9 2. Methods Figure 2.2: The structure of the LSTM cell. 2.5 Ensemble Methods The ensemble method combines the predictions of multiple machine learning models to create a more robust and accurate model. By aggregating the predictions of multiple machine learning models, the idea is that the weaknesses of some models can be compensated by the strengths of other models. In this project, we will employ three ensemble methods. The first one is an unweighted majority voting algorithm, in which multiple machine learning models are trained independently on the same data set. When it predicts a new data point, each model gives its own classification prediction (vote). This algorithm counts the number of votes for each class and chooses the class with the highest count as the final predicted class. We call it ensemble. The second method is Stacking. This method is also known as a stacked generalization, a two-layer model. The first layer involves multiple models (base models) that are trained to make predictions, and the second layer is the meta-model, which is trained on the combined predictions of base models. Each base model may capture different aspects of the data set, and the meta-model can correct or weigh the predictions of individual base models, leading to an improved overall performance. Figure 2.3 shows the overall architecture of the stacking method. The third method is Blending, which is similar to stacking. The main difference is the training of the meta-model is applied on a separate holdout set instead of the full data set. Specifically, the training data set is partitioned into base models training and holdout sets. Base models are trained using the base models training set and subsequently make predictions on the holdout set. Then, the meta-model is trained on predictions from the holdout set. In general, the blending method is believed to be simpler and quicker than the stacking method. In the following description, we 10 2. Methods Figure 2.3: The architecture of stacking. call it BlendStacking. 2.6 Data Points Generation Häger and Köre [9] proposed a parameter module to derive parameters from the message-passing system log data. These parameters are categorized as timed and time- freed and extend the timed and time-freed windows. Subsequently, they developed a simulator to generate a set of data points from these timed and time-free windows to train the machine learning models. 2.6.1 Parameter Module The proposed parameter module represents an approach for transforming message- passing system data into parameters suitable for training machine learning models for failure detection. In addition to leveraging the parameter module, the machine learning-based failure detectors also incorporate an analytical solution as a fallback mechanism. This hybrid approach guarantees that in the event of a node crash, it will eventually be suspected, regardless of the output produced by the machine learning classifier. 2.6.2 Timed and Time-free Windows The training data set for the machine learning models are generated from timed and time-free parameters. The timed parameters are computed through the utilization of clocks or timers, such as Time Since last Heard (TSH). While time-free parameters, which are also known as round-trip-based parameters, are derived from the number of round trips completed in the system, without relying on clocks or timers. An example of a time-free parameter is the Round-Trip Count (RTC). This solution requires that the FD is queried at a fixed time interval. During each query, the FD captures and records the system state, resulting in a timed sample. A fixed number of consecutive samples form a timed window. Figure 2.4 gives an example of a timed window. Similarly, a time-free window includes a sequence of a certain number of complete round-trip counters. For each node pj ∈ P\{pi} that a node pi 11 2. Methods Figure 2.4: A window W with a size of 20 samples, note that the system does not know about the samples outside of W. is communicating with, the FD keeps a count of how many round-trips other nodes pk ∈ P\{pi, pj} have completed with pi since pj last completed a round-trip with pi. 2.6.3 Generating Data Points The data points used to train machine learning classifiers are generated from the values of the timed windows described in Section 2.6.2. In detail, the failure detector is queried at a fixed time interval, say, δ seconds, a data point is created based on the values of the timed windows every δ seconds. The population of the values of the windows is performed by iterating a log file in a chronological order, a method described in Section 2.6.2. 2.7 Precision-Distribution Module Precision-Distribution (PD) module was developed by Häger and Köre [9] to reduce the probability of falsely suspecting a non-faulty node. The PD module takes the form of a 2D-Histogram, which has two dimensions: RTC and TSH. This histogram is populated by having a classifier predict classes for the training set and recording frequencies of suspicions that were either FPs or TPs based on the RTC and TSH values associated with each bucket. Each bucket effectively represents precision (TP/(TP + FP )), signifying the probability of a suspicion being correct. Buckets with no TP or FP entries during training are conservatively assigned a precision value of 1.0. To use the PD module, a threshold (referred to as a histogram threshold) is set to determine when a raised suspicion is deemed valid; if the precision falls below this threshold, the suspicion is ignored. 2.8 Data Leakage The phenomenon of data leakage [22] occurs when information from sources outside of the training dataset is utilized in the training of a machine-learning model. It is a source for poor generalization and overestimation of expected performance, as this extraneous information can lead the model to acquire knowledge that it would not have in a real-world production setting, resulting in overly optimistic models or the invalidation of the model’s estimated performance. Häger and Köre [9] pointed out that a potential issue related to data leakage exists in the generation of training dataset for the machine learning classifiers. To address this 12 2. Methods issue, we plan to rectify the data leakage problem, regenerate the training dataset, conduct the experiments and collect the corresponding results. 13 2. Methods 14 3 Implementation We present the implementation in this chapter. In Section 3.1, we explained the Li-Marin LSTM-based FD and our implementation. Next, the implementation of FDs based on binary classifiers and ensemble methods are introduced in Section 3.2 and Section 3.3. 3.1 The LSTM-based Unreliable Failure Detector by Li and Marin Li and Marin [10] proposed a timeout-based heartbeat failure detector that utilizes the LSTM model [11] to predict the arrival time of next heartbeat message. 3.1.1 Li-Marin Solution Overview The solution is centered around the use of regular heartbeat messages transmitted by nodes in a distributed system. A group of nodes (monitored nodes) generate a heartbeat message h containing the sender ID and a sequence number and send it to a monitoring node every ∆ second. Upon reception of a heartbeat, the monitoring node records its arrival time in the log associated with the sender. This process is illustrated in Figure 3.1. To determine the status of monitored node p (trusted or suspicious), the monitoring Figure 3.1: Monitored nodes keep sending heartbeats to the monitoring node. 15 3. Implementation Figure 3.2: Estimating the arrival time of the next heartbeat to suspect failures. node q calculates an expected arrival time for the next heartbeat τ (referred to as the freshness point). If the heartbeat message is received by monitoring node q before the estimated time, then monitored node p is considered to be trusted. Otherwise, if the heartbeat message is not received by the estimated time, q suspects p of failure. This procedure is illustrated in Figure (3.2). With the objective of skewing the output of the failure detector towards over-estimations of the next arrival time, Li and Marin chose to employ a unilateral penalty loss function instead of the commonly used mean squared loss function. In their effort to achieve online predictions, Li and Marin limited the size of the training dataset, as the computational time of the Long Short-Term Memory (LSTM) model [11] needed to be smaller than the expected arrival time of the heartbeats. To fulfill this requirement, they fed the LSTM model with the nearest fixed size of data (500). Although this limited dataset size also meant that data that falls out of the range was discarded, its impact on the parameters of the LSTM model persists. Additionally, Li and Marin have incorporated a dynamic safety margin into their LSTM based failure detector. This safety margin allows the model to accommodate unexpected and variable delays, as it utilizes the ϵ most recent errors to adjust in time. The freshness point (τ) is calculated as per the following equation: τ ≈ LSTM.predict + 1 ϵ k∑ i=k−ϵ errori (3.1) 3.1.2 Our Implementation with Simulative Approaches We use PyTorch [23] to implement the LSTM model as described in Section 2.4.2. The input sequential data includes the arrival times of previous heartbeat messages, while the output predicts the arrival time of the next heartbeat message. In our implementation of the unilateral penalty loss function, we introduce a parameter to adjust the degree of overestimation, aiming to reach a balance between errors in identifying faulty nodes as non-faulty and vice versa. Unlike Li and Marin who assessed their approach on top of real traces: heartbeat transmission logs on PlanetLab.eu [24], we run and evaluate our solution on the data collected by Häger and Köre in [9]. Häger and Köre logged events and wrote them to files, and each event is attached with a timestamp of when it occurred. We choose one type of event that happened periodically and extract its timestamp for the simulation of the arrival time of a heartbeat message. 16 3. Implementation 3.1.3 Advantages and Disadvantages Analysis The comparison between the heartbeat failure detector proposed by Chen, Toueg, and Aguilera in [25] and the machine learning-based heartbeat failure detector reveals that the latter exhibits a significant improvement in terms of detection time while maintaining a comparable level of probability of availability. The probability of availability is defined as the ratio of safe predictions to the total number of predictions. However, it is important to recognize the limitations of the solution proposed by Li and Martin in [10]. Their approach, which is based on fixed interval heartbeat messages, may not be sufficiently robust due to potential network delays and the lack of guarantee of regular message transmission in multi-tasking environments. Although the model may function optimally when the network conditions are favorable and the sender process p is not hindered by other processes, it may not produce accurate results in less ideal circumstances. 3.2 ML-based Failure Detectors The ML-based FDs in this thesis refer to the FDs based on the binary classifiers in Section 2.4.1. 3.2.1 ML Models Construction We implemented our binary classifiers, including Decision Tree, Random Forest, AdaBoost, Logistic Regression, Naive Bayes, and SVM,utilizing the models provided in Scikit-Learn. We customized their hyperparameters, deviating from the default values. The other models ANN, LSTM and FedDyn were implemented using PyTorch models. To improve the stability, performance, and interpretability of classifier models, we used the StandardScaler from Scikit-Learn to standardize features in the generated data points, as detailed in Section 2.6, specifically for Scikit-Learn classifiers. Furthermore, we devised one classifier optimized for AUC PR (Precision-Recall) and another for AUC ROC (Receiver Operating Characteristic) for each Scikit-Learn classifier. Initially, we partitioned the training set into 80% for training and 20% for validation. Both optimized classifiers were then trained on the training subset and evaluated on the validation set. Subsequently, the evaluation results of the validation set were used to generate both PR and ROC curves. The decision threshold from the PR curve aimed to minimize the gap between precision and recall, whereas from the ROC curve, the threshold aimed to maximize the difference between sensitivity and fall-out. Finally, these thresholds were applied to the corresponding classifiers trained on the entire training set. 17 3. Implementation Figure 3.3: A failure detector combines the ML-based FD with analytical fallback. 3.2.2 ML-based FDs with Analytical Fallback Note that ML-based FDs cannot guarantee FD completeness. To address this problem, we use the analytical FDs that are of class ♢P as a fallback for all ML- based FDs, which means that suspicions made by ML-based FDs are overridden by the fallback. The BDBD FD falls under class ♢P , as it can ensure the properties of FD completeness and accuracy. Therefore, we use the BDBD solution as a fallback for all ML-based FDs. Additionally, The PD module can help ML-based FDs reduce the probability of falsely suspecting non-faulty nodes by using a threshold, preventing FDs from raising suspicions when the precision is low. With these considerations, our performance evaluation encompasses varying levels of mistake rates with PD enabled and BDBD FD used as a fallback. The structure of the ML-based FD with analytical fallback is shown in Figure 3.3. 3.3 Ensemble Failure Detectors As outlined in Section 2.5, we introduced three ensemble methods, leveraging the StackingClassifier provided by Scikit-Learn for their implementation. Notably, the outputs of these ensemble methods are contingent upon the outputs of ML-based FDs. Considering the influence of the PD module on the outputs of ML-based FDs, it is evident that the outputs of the ensemble methods are likewise affected by the PD module. Consequently, there is no need for a separate PD module specifically tailored for the ensemble methods. 18 4 Evaluation This chapter presents our empiric evaluation plan. Units Under Evaluation. We assess the ability of ML-based classifiers, including SVM, Naive Bayes,AdaBoost, Neural Network, Logistics Regression, Random Forest and Decision Tree, to solve the problem of failure detectors (FDs). We refer to these FDs as individual ML-based FDs. Additionally, We consider two non-ML-based FDs, named BDBD and TTF, as described in Section 2.3. We combine these individual FDs using ensemble techniques that are called unweighted majority voting, stacking, and blending (refer to Section 2.5). Chapter Organization. We start by describing the evaluation environment, which includes collected real data that is enriched using stimulative approaches. Then, we state the evaluation criteria, which is based on prevailing ML metrics. We describe different methods for examining the structural properties of the studied data and our ML-based FDs using clustering and ranking. Considering the unique nature of the LSTM FD by Li and Marin [10], an extra metric, named Probability of Availability, is used for its evaluation. 4.1 Evaluation Environment In alignment with Häger and Köre, we conducted our evaluation using simulative approaches, executing both “time cost” and “accuracy” suites in our experiment plan. As the nodes on PlanetLab.eu exhibit considerable stability and are unlikely to crash, we opt to simulate the crashes through the use of synthetic crash failures via the simulator developed by Häger and Köre. This simulator is used to “play back” the collected logs and create the data points employed in training and evaluating the proposed failure detectors. Given that these data points are produced through simulated rather than actual failures, this approach is referred to as simulative as it employs both collected and simulated data. In cases where the collected data contains failures, these data will be used; if no data contains failures, the simulated data will be employed. During the “play back” of the collected logs, any messages received from a node during a period in which the node is synthetically crashed (muted) are ignored by the broadcasting algorithm. Consequently, the broadcaster encounters a lack of 19 4. Evaluation Figure 4.1: The chronological order of the “crash” and “non-crash” intervals. response from the currently “crashed” node. Specifically, a crash interval refers to a specific time span during which the broadcaster encounters the absence of responses from the presumed “crashed” node. Conversely, a crash-free interval signifies a time span during which the broadcaster can reliably receive messages from the node as expected. Note that within the context of this study, these crash intervals and crash- free intervals occur in an alternating manner throughout the observation duration. Each crash or non-crash interval is called a period in the subsequent description. Figure 4.1 shows the chronological order of these intervals for a node pi. 4.2 Evaluation Criteria We evaluate the individual ML-based FDs and ensemble FDs from two perspectives. The first aspect involves examining the overall probability of raising a false suspicion, which we refer to Pr(FS). An FD with lower Pr(FS) is considered more favorable than one with higher Pr(FS). The second aspect is to investigate the relationship between Pr(FS) and the detection time for all FDs. An FD that achieves a faster detection time may involve implementing simpler algorithms or processing techniques that prioritize speed over thorough analysis. This could lead to a higher Pr(FS). Conversely, an FD that focuses on improving Pr(FS) may require more complex algorithms or extensive data analysis, which could increase the time needed for detection. This may result in a slower detection time as the detector thoroughly evaluates the data to minimize errors. A preferable FD is one that balances this trade-off, either by having shorter detection times while maintaining Pr(FS), or by decreasing Pr(FS) without incurring longer detection time. 4.2.1 Classification Metrics We employ classification metrics, which are described below, to evaluate the Pr(FS) of our individual ML-based detectors and the ensemble FDs. The machine learning classifiers used in our failure detectors will be evaluated based on metrics derived from a confusion matrix. A confusion matrix is a table that is used to define the performance of a classification solution, which is shown in Figure 4.2. The derived metrics from the confusion matrix include Accuracy, Sensitivity, Specificity, Precision, Fall-Out, False discovery rate, and False omission rate. The definitions of these metrics are in Table 4.1. Binary classifiers use the discrimination threshold to classify a data point. The discrimination threshold is the probability or score at which the positive class is chosen over the negative class. We can gain further insights into the performance 20 4. Evaluation Figure 4.2: A confusion matrix. Metric Name Derivation Description Accuracy T P +T N T P +F P +T N+F N The ratio of data points that were correctly classified. Sensitivity T P T P +F N The ratio of positive data points that were correctly classified. Specificity T N T N+F P The ratio of negative data points that were correctly classified. Precision T P T P +F P The ratio of positive classifications that turned out to be correct. Fall-Out F P F P +T N The ratio of negative data points that were incorrectly classified. Can also be written as 1-Specificity. False Dis- covery Rate (FDR) F P F P +T P The ratio of the number of false positive clas- sifications (false discoveries) to the total num- ber of positive classifications (rejections of the null). False Omis- sion Rate (FOR) F N F N+T N The ratio of actual positive instances that are incorrectly predicted as negative. Table 4.1: Metrics derived from a confusion matrix. of classifiers by combining the derived metrics in Table 4.1 and this discrimination threshold. Our study focuses on two methods for the present approach, specifically the analysis of the Receiver Operating Characteristic (ROC) curve [26], and the Precision-Recall (PR) curve [27]. We acknowledge that Recall is synonymous with Sensitivity. Both curves are generated by adjusting the threshold from 0 (predicting all as positive) to 1 (predicting all as negative). In the ROC curve, Sensitivity is plotted on the y-axis and Fall-Out on the x-axis. The PR curve, on the other hand, employs Precision on its y-axis, and Recall on its x-axis. Once these curves are established, a threshold can be selected from them in order to optimize for desired metrics. We signify that 21 4. Evaluation a classifier has been optimized in this way by appending its name with either PR or ROC. After the curves are constructed, one can determine the Area Under Curve (AUC), which ranges from 0 to 1. For AUC ROC, a higher score signifies that the classifier can efficiently differentiate between the two classes. The AUC PR gauges the classifier’s ability to accurately classify the positive data points without considering accurately classified negative data points. Nonetheless, we use only the AUC ROC as an evaluation metric, rather than the AUC PR, as it is more commonly employed. 4.2.2 Time and Accuracy Metrics Similar to the FD Quality of Service (QoS) metrics defined by Chen et al. [25], we establish the time and accuracy metrics for evaluating FDs. These metrics are derived from the FDs’ outputs. Specifically, the time metrics measure the duration during which the FD either trusts or suspects a node. The duration is measured using S- and T-transitions, where an S-transition signifies a transition from trust to suspicion, and a T-transition indicates a transition from suspicion to trust. Note that the First Detection Time (TD1) is introduced by Häger and Köre in [9]. The detailed time metrics is shown below: • Time Metrics – Detection Time (TD): The duration from when a node crashes to when it is permanently suspected, denoting the time until the last S-transition, after which no further T-transitions occur. – First Suspected Time (TD1): The duration from when a node crashes to when it is initially suspected, representing the time until the first S-transition takes place. The accuracy metrics indicate how effectively an FD avoids making mistakes. They are shown below: • Accuracy Metrics – Average Mistake Rate (λM): The probability of making a mistake per time unit, representing the average number of S-transitions per time unit. – Query Accuracy Probability (PQA): The probability that the FD’s output is correct at any given time. We use two main metrics, Detection Time (TD) and Average Mistake Rate (λM), to evaluate the performance of the ML-based FDs and investigate whether they can achieve a better balance between the detection time and the probability of raising a false suspicion. To clarify, we use TD to represent the detection time in the trade-off, as it indicates the convergence of the failure detectors’ suspicions, and λM to approximate Pr(FS), which represents the probability of raising a false suspicion. We choose λM as it assumes that making a mistake leads to a state transfer, and the duration of the error is irrelevant. In our system, each FD is queried at most once every 500ms, and we take this as the unit of time for λM . 22 4. Evaluation 4.3 Experiment Plan 4.3.1 Experiment suites We set two experiment suites to evaluate the time and accuracy metrics. The detection time experiment suite, denoted by ET , is conducted when one node, pe, fails, i.e.,, eventually crashes. In this case, TD is computed by measuring the period from the time of the crash to the last S-transition (Section 4.2.2), which is the point where no further T-transitions occur thereafter. The accuracy experiment suite, denoted by EA, is executed during periods where pe does not fail. To evaluate the accuracy metrics, the number of mistakes made by the FD during these periods is counted. Any suspicion made for pe in these periods is a mistake, and by definition, λM is the number of S-transitions per time unit. As mentioned in Section 4.2, there is a trade-off between detection time and accuracy for an FD. We explore the performance of FDs by considering this trade-off between the time and accuracy metrics, following the approach of Chen et al. [25] and Häger-Köre [9]. 4.3.2 Experiment thresholds We demonstrate the properties of the trade-off for each FD, by executing the experiment in ET and EA with three types of thresholds. These thresholds include the θ parameter for the BDBD solution (Section 2.3.1), the 2D probability threshold for the TTF solution (Section 2.3.2), and the histogram threshold for the PD module (Section 2.7). We run both EA and ET experiments for each set of thresholds in the assessment. This approach facilitates establishing a correlation between specific detection time and its corresponding accuracy. Our exploration is based on the assumption that increasing the values of thresholds will result in a decrease in the Average Mistake Rate (λM) and an increase in Detection Time (TD). Exploring a wide range of threshold values provides a deeper comprehension of the trade-off between the time and accuracy metrics for FDs. However, due to the time-consuming nature of running ET and EA experiments, we select a limited set of threshold values, which is shown in Table 4.2. These threshold values are chosen after analyzing the levels of Specificity that ML- based FDs achieve within the training set. For instance, to achieve a Specificity of five nines for the Random Forest FD with the PD module, the histogram threshold in the PD module (Section 2.7) should be set to 0.99. While achieving a Specificity of six nines, a threshold of 0.99998 is required. It’s important to note that the threshold value required for each Specificity level may vary across ML-based FDs, as it depends on the performance of each FD on the training set. 23 4. Evaluation 0.990 0.9990 0.99990 0.999990 0.9999990 0.991 0.9991 0.99991 0.999991 0.9999991 0.992 0.9992 0.99992 0.999992 0.9999992 0.993 0.9993 0.99993 0.999993 0.9999993 0.994 0.9994 0.99994 0.999994 0.9999994 0.995 0.9995 0.99995 0.999995 0.9999995 0.996 0.9996 0.99996 0.999996 0.9999996 0.997 0.9997 0.99997 0.999997 0.9999997 0.998 0.9998 0.99998 0.999998 0.9999998 Table 4.2: Various values of thresholds 4.3.3 Experiment configurations We run the experiment suites, ET and EA, with various threshold values under three experiment configurations: Canalytical, CP D and Cfallback. The Canalytical configuration is to establish a baseline for the time and accuracy metrics of analytical FDs (the BDBD FD and the TTF FD) at varying thresholds, offering a benchmark and a basis for comparison with ML-based FDs through the BDBD solution. The CP D configuration is to analyze the time and accuracy metrics for ML-based FDs with the PD module. We aim to decrease the probability of falsely suspecting a non- faulty node for ML-based FDs by leveraging the PD module. In this configuration, we expect the ML-based FDs to achieve a shorter detection time (TD) while maintaining the same level of average mistake rate (λM ) as analytical FDs. Besides, an increase in the PD module threshold is expected to lead to a decline in TD and an improvement in λM for FDs. The Cfallback configuration enables us to evaluate a solution wherein ML-based FDs not only incorporate the PD module but also integrate the BDBD solution as a fallback, ensuring the eventual detection of crashed nodes regardless of the ML classifier output. BDBD solution working as a fallback means that the suspicion made by BDBD will overrule the output of the ML-based FDs. This configuration includes three sub-configurations: Cfallback,higher, Cfallback,equal and Cfallback,lower based on whether the threshold value of BDBD is higher, equal to or lower than the threshold value of PD module. Specifically, when a PD module threshold is set to achieve a specificity of X number of 9s in the train set, for Cfallback,higher, Cfallback,equal and Cfallback,lower, the threshold of BDBD is set based on a specificity level of X + 1 number of 9s, X number of 9s and X − 1 number of 9s, respectively. Our expectation for Cfallback is that BDBD solution may have more influence on the final output in the Cfallback,lower sub-configuration as BDBD overrules the output of the ML-based FDs once its threshold is met. If the TD of BDBD is longer than that of ML-based FDs, as is likely, the BDBD solution’s impact on the final results will be negligible in Cfallback,higher and Cfallback,equal. However, if the TD of BDBD is shorter than that of ML-based FDs, the BDBD solution will impact the final results in all configurations of Cfallback, with Cfallback,lower being the one with the greatest impact. 24 4. Evaluation The permissible threshold values for the Canalytical and CP D configurations are the values listed in the second to fifth columns of Table 4.2. In each run of ET and EA within the Cfallback configuration, one threshold for BDBD and one threshold for the PD module are required. The permissible combinations of threshold values are the adjacent values in adjacent columns in Table 4.2. For example, in the sub-configuration Cfallback,higher, potential threshold values for the PD module and BDBD include {0.990, 0.9990}, {0.991, 0.9991} and so forth. While in the sub-configuration Cfallback,lower, the feasible threshold values for PD module and BDBD are {0.9990, 0.990}, {0.9991, 0.991}, and similar combinations. For the sub-configuration Cfallback,equal, the possible threshold values for PD module and BDBD are {0.9990, 0.9990}, {0.9991, 0.9991}, and onwards. 4.4 Clustering and Ranking ML-based FDs To investigate the structural patterns in the input/output behaviors of the ML-based FDs proposed by Häger and Köre in [9], the values of False Positive (FP) and False Negative (FN) of each ML classifier were compiled for each data point. Specifically, we use the results of FP/FN to cluster these ML classifiers using K-means algorithm [28] at the crash intervals where any ML classifiers report FN and the crash-free intervals where any ML classifier reports FP. We rank the ML-based FDs in terms of specificity and sensitivity to find the top- performing FDs. To facilitate further comparison and enable the selection of the best two or three FDs, a pairwise comparative analysis of the candidate FDs will be conducted using the following criteria. Let us consider the pairwise comparison between FDs A and B: • A & B: both FDs A and B make correct predictions. • ¬A & B: FD A makes an incorrect prediction while FD B makes a correct prediction. • A & ¬B: FD A makes a correct prediction while FD B makes an incorrect prediction. • ¬A & ¬B: both FDs A and B make incorrect predictions. By examining and comparing these distinct scenarios, we aim to determine the most reliable FDs. 4.5 Li-Marin LSTM FD Evaluation The Li-Marin LSTM FD [10] gives the status of a node by comparing the actual arrival time of the heartbeat message and the predicted time, which makes its evaluation different from the ML-based FDs by Häger and Köre. The Li-Marin LSTM FD may produce two error types: falsely identifying non-faulty nodes as faulty nodes (FP errors) and falsely identifying faulty nodes as non-faulty nodes (FN 25 4. Evaluation Figure 4.3: The places where FP/FN errors may occur in Li-Marin LSTM FD. errors). The places where these two types of errors may occur are shown in Figure 4.3. Recall that we want to control this FD in a manner that inclines it towards over- estimation of the anticipated arrival time of the heartbeat message. However, we should recognize that the extent of this overestimation impacts the occurrences of FP errors and FN errors. In the scenario where the overestimation is substantial, the FP errors diminish while the FN errors escalate. Conversely, if the overestimation is small, the FP errors increase while the FN errors decrease. Balancing these two types of errors is important for the effective functioning of the failure detector. To address this, we introduce a variable to control the extent of over-estimation. In addition to the metrics that we have listed previously, we introduce a new metric called Probability of Availability (PA) (following the approach of Li and Marin), which is the ratio of the safe predictions (actual arrival time is ahead of the predicted time) over the total predictions. This metric indicates the ability of this FD to avoid wrong suspects. 26 5 Results Our results show in Section 5.1 that classifiers such as Random Forest, SVM, and Logistic Regression outperform other classifiers in general. We also compare, in Section 5.2, the QoS metrics of ML-based FDs with PD module and fallback to the BDBD solution (Section 2.3.1). This compartment shows that Random Forest, SVM, and Logistic Regression can balance well the trade-off between detection time and accuracy, but only when aiming at short detection time, ca 1 second. When the detection time is more than 2 seconds is considered, BDBD and TTF can offer better accuracy. In section 5.3, we cluster the ML-based FDs at both the crash intervals where any ML- classifiers report FN and the data points where any FD reports FP. Based on these results, the FDs are ranked based on their specificity and sensitivity. Additionally, a pairwise comparison of the top-performing FDs is conducted toward the conclusion of this section. Further results of all clustering outputs for all FDs over various datasets are shown in Appendix A. Subsequently, Section 5.4 presents the evaluation of the LSTM-based heartbeat FD, focusing on the probability of availability (PA) and the classification metrics. Finally, we explain our results and discuss their limitations in Section 5.5. 5.1 ML-based FDs without PD Module We present the results of ML-based binary classification FDs without the use of the Precision Distribution (PD) module (refer to Section 2.7). Our results were collected after resolving the data leakage problem in Häger-Köre’s implementation. The confusion matrix results for the FDs over 15.8 × 106 data points are presented in Table 5.1, while Table 5.2 displays the ratios of all FDs concerning the best one, in terms of both false positives (FP) and false negatives (FN). Figure 5.1 provides an intuitive understanding of FP and FN errors across all ML-based FDs. For detailed derived metrics data and their respective ratios concerning all FDs relative to the best one, see Table 5.3 and Table 5.4. These results were collected for both individual ML-based FDs and ensemble FDs, namely Stacking, BlendStacking and Ensemble. Stacking FD incorporates SVM and Random Forest, leveraging SVM’s superior performance in minimizing FP errors and 27 5. Results Random Forest’s effectiveness in reducing FN errors. The aim is to capitalize on the strengths of both SVM and Random Forest within the Stacking FD. Additionally, BlendStacking FD integrates SVM, Random Forest, and analytical FDs: BDBD and TTF (Section 2.3). As per the findings in Section 5.2, while ML-based FDs demonstrate fewer prediction errors with detection time under two seconds, analytical FDs exhibit a lower average mistake rate with longer detection time. Through the integration of the outputs from both ML-based FDs and analytical FDs, we hope the BlendStacking FD can deliver robust performance across all detection time intervals. The Ensemble FD integrates all individual ML-based FDs. From the analysis of Figure 5.1, it shows that the FDs SVM and Naive Bayes tend to produce more FN errors. Conversely, the other FDs (AdaBoost, Neural Network, Logistics Regression, Random Forest and Decision Tree) tilt to more FP errors. Notably, SVM produces the minimum number of FP errors, followed by Random Forest and Logistics Regression. Although Neural Network produced the maximum number of FP errors, it has the fewest FN errors. Note that Neural Network, Naive Bayes and Decision Tree are the most polarized FDs concerning FP and FN errors whereas SVM and Random Forest can harmonize these types of errors. In summary, when considering both FP and FN errors, the individual ML-based FDs SVM and Random Forest are the best, whereas Naive Bayes ranks the least favorable classifier. The Stacking FD and Ensemble FD strike a good balance between FP and FN errors, while the performance of BlendStacking FD is not eye-catching. According to the data in Table 5.2, although the ensemble FDs, namely Stacking and BlendStacking, may not perform the best in terms of FP errors, they are not far from the best-performing classifier, specifically SVM. The individual ML-based FDs, such as Random Forest and Naive Bayes are also performing well. Moreover, The individual ML-based FDs, including Random Forest, AdaBoost and Decision Tree, along with the ensemble FD Stacking show strong performance in terms of FN errors, closely trailing the top performer (Neural Network). Figure 5.2 demonstrates the comprehensive comparison of classifier performance in terms of the various metrics encompassing accuracy, sensitivity, specificity, precision, and AUC ROC. In general, SVM, Random Forest and Logistics Regression perform well across all metrics. While AdaBoost and Decision Tree show notable performance across several metrics, their behaviors in terms of specificity and precision are not satisfying. Neural Network and Naive Bayes present good performance in one or two metrics, albeit manifesting poor results in others. 5.2 ML-based FDs with PD Module and Fallback We now show the performance of ML-based FDs across various levels of the average mistake rate (λM ) (Section 4.2.2) with PD module enabled and BDBD as the fallback. Note that as the Naive Bayes FD produces significantly more prediction errors than other FDs, we exclude it from subsequent evaluations. Following the footsteps of Häger and Köre [9], we optimized the training of ML-based FDs using the Precision- Recall (PR) curve, where Precision values on the y-axis and Recall values on the 28 5. Results ML-Classifiers FP TP FDR FN TN FOR AdaBoost 4952 3683719 0.1342% 2990 12153543 0.0246% Neural Network 8225 3684027 0.2228% 2682 12150270 0.0221% Logistics Regression 4150 3680967 0.1126% 5742 12154345 0.0472% Naive Bayes 3217 3666098 0.0877% 20611 12155278 0.1693% Random Forest 2868 3683815 0.0778% 2894 12155627 0.0238% Decision Tree 6486 3683599 0.1758% 3110 12152009 0.0256% SVM 1630 3679244 0.0443% 7465 12156865 0.0614% Stacking 2288 3683619 0.0621% 3090 12156207 0.0254% BlendStacking 2463 3680042 0.0669% 6667 12156032 0.0548% Ensemble 2248 3682871 0.0610% 3838 12156247 0.0316% Table 5.1: The confusion matrix data results from evaluating the classifiers on the test set. Refer to Section 4.2.1 for the definition of these metrics. AB NN⋄ LR NB RF DT SVM∗ Stacking BlendStacking Ensemble FP Ratios 2.038 4.046 1.546 0.974 0.760 2.979 0 0.4037 0.5110 0.3791 FN Ratios 0.115 0 1.141 6.685 0.079 0.160 1.783 0.1521 1.4858 0.4310 Table 5.2: The ratio x/X − 1 is evaluated for each FD, where x is the number of FP or FN, and X is the best FP, and respectively, FN value among all FDs. The best FP value is the SVM FD (marked with ∗), and the best FN value is the Neural Network FD (marked with ⋄). AB NN LR NB RF DT SVM ST BSC ENS Machine Learning Classifiers 0 1000 2000 3000 4000 5000 6000 7000 8000 FP v al ue s FP 0 2500 5000 7500 10000 12500 15000 17500 20000 FN v al ue s AB: AdaBoost, NN: Neural Network, LR: Logistics Regression, NB: Naive Bayes, RF: Random Forest, DT: Decision Tree ST: Stacking, BSC: BlendStacking, ENS: Ensemble FN Figure 5.1: Visualization of the FP and FN values for different ML classifiers. 29 5. Results ML-Classifers Accuracy Sensitivity Specificity Precision Fall-Out AUC ROC AdaBoost 0.999499 0.999189 0.999593 0.99866 0.000407 0.999925358 Neural Network 0.999312 0.999273 0.999324 0.997772 0.000676 0.999298020 Logistics Regression 0.999376 0.998443 0.999659 0.998874 0.000341 0.999879967 Naive Bayes 0.998496 0.994409 0.999735 0.999123 0.000265 0.999045223 Random Forest 0.999636 0.999215 0.999764 0.999222 0.000236 0.999828069 Decision Tree 0.999394 0.999156 0.999467 0.998242 0.000533 0.999820163 SVM 0.999426 0.997975 0.999866 0.999557 0.000134 0.999876591 Stacking 0.999636 0.999215 0.999764 0.999557 0.000236 0.999603392 BlendStacking 0.999424 0.998192 0.999797 0.999331 0.000203 0.999632772 Ensemble 0.999616 0.998959 0.999815 0.999390 0.000185 Table 5.3: The classification matrics derived from the confusion matrix data. ML-Classifers Accuracy Ratio Sensitivity Ratio Specificity Ratio Precision Ratio Fall-Out Ratio AUC ROC Ratio AdaBoost 0.376374 0.115543 2.037313 2.024831 -0.000273 0 Neural Network 0.890110 0 4.044776 4.029345 -0.000542 8.404625 Logistics Regression 0.714286 1.141678 1.544776 1.541761 -0.000207 0.608116 Naive Bayes 3.131868 6.690509 0.977612 0.979684 -0.000131 11.791418 Random Forest 0 0.079780 0.761194 0.756208 -0.000102 1.303408 Decision Tree 0.664835 0.160935 2.977612 2.968397 -0.000399 1.409327 SVM 0.576923 1.785420 0 0 0 0.653345 Stacking 0 0.079780 0.761194 0 -0.000102 4.313470 BlendStacking 0.582418 1.486933 0.514925 0.510158 -0.000069 3.919857 Ensemble 0.054945 0.431912 0.380597 0.376975 -0.000051 Table 5.4: The ratio (1 − x)/(1 − X) − 1 is evaluated for each classification metric, where X represents the best value within a specific column. Note that the best Fall-Out value corresponds to the smallest value, whereas for other metrics, the best value is the largest one. x-axis. The threshold values that we used here are defined in Table 4.2. Please note that the thresholds were selected based on the level of Specificity the FDs achieved in the training set. We want to know the Specificity they attained in the test set. The Specificity of FDs at various thresholds are presented in Table 5.5. In Table 5.6, we provides the ratio of the detection time (TD) of both individual ML- based FDs and the ensemble FDs to that of BDBD across configurations, including CP D, Cfallback,higher, Cfallback,equal, and Cfallback,lower. For a comprehensive overview, Figure 5.3 shows the trade-off between TD and λM for the configuration of CP D. The evaluation results for the configuration Cfallback are shown in Figures 5.4, 5.5, and 5.6. These figures correspond to the results of Cfallback,higher, Cfallback,equal and Cfallback,lower, respectively. Table 5.6 shows that the TD for ML-based FDs is generally better than that for BDBD solution across various error rate levels when λM ≥ 1 × 10−5. However, for error rates below 1 × 10−5, some ML-based FDs show a higher detection time compared to BDBD (indicated by values greater than 1 in the corresponding cells). Note that lower λM requires longer TD, the data in the table implies that some ML-based FDs might perform worse than BDBD when we allow longer TD. 30 5. Results AB NN LR NB RF DT SVM ST BSC ENS Machine Learning Classifiers 0 2 4 6 8 10 12 Cl as sif ica tio n M et ric s R at io V al ue s AB: AdaBoost, NN: Neural Network, LR: Logistics Regression, NB: Naive Bayes, RF: Random Forest, DT: Decision Tree ST: Stacking, BSC: BlendStacking, ENS: Ensemble Accuracy Sensitivity Specificity Precision AUC_ROC Figure 5.2: Visualization of the classification metrics values for different ML classifiers. Figure 5.3 shows a similar trend from another perspective. The left-hand figure, a zoomed-in view, shows that when λM ≥ 10−5 and the detection time is less than 2 seconds, all individual ML-based FDs can achieve lower error rates than analytical FDs (TTF and BDBD). The right-hand figure (zoomed-out view) shows that individual ML-based FDs perform worse than analytical FDs when the detection time is longer than 5 seconds. In the context of individual ML-based FDs, it is obvious that Random Forest generally performs better than the other classifiers, and Nerual Network has the highest error rate but the shortest detection time. This is consistent with the results in Section 5.1. Furthermore, when looking at the ensemble FDs (Stacking, BlendStacking, and Ensemble), we see that they consistently outperform analytical FDs when the detec- tion time is under 2 seconds, with the Stacking FD consistently exhibiting superior performance across all configurations. However, as the detection time increases, the Stacking FD quickly becomes worse than analytical FDs, then the Ensemble FD also becomes worse than analytical FDs. Only BlendStacking FD is always better than analytical FDs. Despite variations in details across all configurations, Figures 5.4, 5.5, and 5.6 illustrate similar trends. 31 5. Results FD (config) Specificity AdaBoost PR (t = 0.999) 0.99978088 AdaBoost PR (t = 0.9999) 0.99998784 AdaBoost PR (t = 0.99999) 0.99999880 AdaBoost PR (t = 0.9999999) 0.99999880 Decision Tree PR (t = 0.999) 0.99952300 Decision Tree PR (t = 0.9999) 0.99995721 Decision Tree PR (t = 0.99999) 0.99999811 Decision Tree PR (t = 0.9999999) 0.99999811 Logistic Regression PR (t = 0.999) 0.99965037 Logistic Regression PR (t = 0.9999) 0.99999105 Logistic Regression PR (t = 0.99999) 0.99999905 Logistic Regression PR (t = 0.9999999) 0.99999905 NN (t = 0.999) 0.99935398 NN (t = 0.9999) 0.99998885 NN (t = 0.99999) 0.99999842 NN (t = 0.999999) 0.99999842 Random Forest PR (t = 0.999) 0.99976380 Random Forest PR (t = 0.9999) 0.99993937 Random Forest PR (t = 0.99999) 0.99999634 Random Forest PR (t = 0.999999) 0.99999710 FD (config) Specificity SVM PR (t = 0.999) 0.99970973 SVM PR (t = 0.9999) 0.99999351 SVM PR (t = 0.99999) 0.99999899 SVM PR (t = 0.999992) 0.99999899 TTF (t = 0.999) 0.99956485 TTF (t = 0.9999) 0.99999950 TTF (t = 0.99992) 1.0 BDBD (t = 0.999) 0.98968365 BDBD (t = 0.9999) 0.99983936 BDBD (t = 0.99999) 0.99999950 BDBD (t = 0.999992) 1.0 Ensemble (t = 0.999) 0.99982487 Ensemble (t = 0.9999) 0.99999338 Ensemble (t = 0.99999) 0.99999924 Ensemble (t = 0.999999) 0.99999924 Stacking (t = 0.999) 0.99980930 Stacking (t = 0.9999) 0.99995066 Stacking (t = 0.99999) 0.99999779 Stacking (t = 0.999999) 0.99999792 BlendStacking (t = 0.999) 0.99983703 BlendStacking (t = 0.9999) 0.99999950 BlendStacking (t = 0.99999) 1.0 Table 5.5: The specificity attained on the test set when employing a configuration with a threshold that yielded a specificity of t in the training set. 32 5. Results λ M C on fig ur at io n A B P R N N R F P R SV M P R D T P R LR P R St ac ki ng B le nd St ac ki ng E ns em bl e λ M ≤ 1 × 10 − 2 C P D 0. 56 0 0. 46 9 0. 48 4 0. 66 4 0. 53 4 0. 64 6 0. 48 4 0. 71 5 0. 59 4 C f a ll ba ck ,h ig h er 0. 02 0 0. 01 8 0. 01 8 0. 02 5 0. 02 0 0. 02 5 0. 01 8 0. 03 0 0. 02 3 C f a ll ba ck ,e q u a l 0. 51 3 0. 46 3 0. 47 8 0. 61 4 0. 50 7 0. 59 9 0. 47 8 0. 71 5 0. 58 8 C f a ll ba ck ,l o w er 0. 64 5 0. 58 8 0. 61 1 0. 73 2 0. 63 9 0. 71 8 0. 61 1 0. 83 2 0. 72 3 λ M ≤ 1 × 10 − 3 C P D 0. 56 0 0. 46 9 0. 48 4 0. 66 4 0. 53 4 0. 64 6 0. 48 4 0. 71 5 0. 59 4 C f a ll ba ck ,h ig h er 0. 02 0 0. 01 8 0. 01 8 0. 02 5 0. 02 0 0. 02 5 0. 01 8 0. 03 0 0. 02 3 C f a ll ba ck ,e q u a l 0. 51 3 0. 46 3 0. 47 8 0. 61 4 0. 50 7 0. 59 9 0. 47 8 0. 71 5 0. 58 8 C f a ll ba ck ,l o w er 0. 59 8 0. 67 0 0. 50 0 0. 72 3 0. 63 8 0. 69 1 0. 50 0 0. 91 1 0. 70 3 λ M ≤ 1 × 10 − 4 C P D 0. 54 9 0. 64 1 0. 52 9 0. 58 0 0. 76 1 0. 69 3 0. 52 9 0. 66 9 0. 51 6 C f a ll ba ck ,h ig h er 0. 02 3 0. 02 8 0. 02 3 0. 02 6 0. 03 3 0. 03 0 0. 02 3 0. 03 0 0. 02 3 C f a ll ba ck ,e q u a l 0. 50 6 0. 57 2 0. 52 7 0. 67 5 0. 72 1 0. 62 4 0. 52 7 0. 66 9 0. 56 0 C f a ll ba ck ,l o w er 0. 96 8 0. 95 9 0. 55 0 0. 96 7 0. 90 3 0. 96 0 0. 55 0 1. 0 0. 97 9 λ M ≤ 1 × 10 − 5 C P D 0. 68 4 4. 90 5 0. 64 5 5. 51 0 0. 61 6 5. 19 8 0. 64 5 0. 52 1 1. 44 5 C f a ll ba ck ,h ig h er 0. 06 6 0. 47 9 0. 06 3 0. 53 9 0. 06 0 0. 50 8 0. 06 3 0. 03 0 0. 50 8 C f a ll ba ck ,e q u a l 0. 65 5 1. 11 4 0. 64 4 1. 74 3 0. 59 8 1. 42 5 0. 64 4 0. 52 1 1. 43 00 C f a ll ba ck ,l o w er 0. 97 6 0. 98 3 0. 35 6 0. 98 2 0. 97 2 0. 97 5 0. 35 6 1. 0 0. 98 5 Ta bl e 5. 6: T he de te ct io n tim e (T D ) of ea ch FD re la tiv e to th e T D of th e BD BD so lu tio n fo r at ta in in g sp ec ifi c va lu es of th e av er ag e m ist ak e ra te (λ M ) in th e co nfi gu ra tio ns : C P D ,C f a ll ba ck ,h ig h er ,C f a ll ba ck ,e q u a l, an d C f a ll ba ck ,l o w er . A B ,N N ,R F, LR an d D T ar e th e ab br ev ia tio ns of A da bo os t, N eu ra lN et wo rk ,R an do m Fo re st ,L og ist ic s R eg re ss io n an d D ec isi on Tr ee re sp ec tiv el y. 33 5. Results 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Detection Time (s) 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Av g. M ist ak e Ra te AdaBoost PR NN Random Forest PR SVM PR Decision Tree PR Logistic Regression PR BDBD TTF Stacking PR BlendStacking Ensemble (a) Result for the zoomed-in time range 0 to 3 seconds. 0 5 10 15 20 25 30 Detection Time (s) 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Av g. M ist ak e Ra te AdaBoost PR NN Random Forest PR SVM PR Decision Tree PR Logistic Regression PR BDBD TTF Stacking PR BlendStacking Ensemble (b) Result over the time range 0 to 30 seconds. Figure 5.3: The trade-off between TD and λM for the results of CP D without fallback. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Detection Time (s) 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Av g. M ist ak e Ra te AdaBoost PR NN Random Forest PR SVM PR Decision Tree PR Logistic Regression PR BDBD TTF Stacking PR BlendStacking Ensemble (a) Result for the zoomed-in time range 0 to 3 seconds. 0 5 10 15 20 25 30 Detection Time (s) 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Av g. M ist ak e Ra te AdaBoost PR NN Random Forest PR SVM PR Decision Tree PR Logistic Regression PR BDBD TTF Stacking PR BlendStacking Ensemble (b) Result over the time range 0 to 30 seconds. Figure 5.4: The trade-off between TD and λM for the results of Cfallback,higher and CP D. 5.3 Clustering ML-Based Binary Classifiers 5.3.1 Crash Intervals The ML-Classifiers were grouped into clusters during the crash intervals in which at least one ML-Classifier reported FN using K-means algorithm [28]. The entire test dataset consisted of 10,573 crash intervals, and among these, a total of 10,537 crash intervals were identified wherein at least one ML-Classifier reported FN. For the crash intervals where no ML-Classifier reported FN errors, clustering the ML-Classifiers was unnecessary. From the Table 5.1 and the clustering results (table A.2) from the K-means algorithm, we notice that: 34 5. Results 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Detection Time (s) 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Av g. M ist ak e Ra te AdaBoost PR NN Random Forest PR SVM PR Decision Tree PR Logistic Regression PR BDBD TTF Stacking PR BlendStacking Ensemble (a) Result for the zoomed-in time range 0 to 3 seconds. 0 5 10 15 20 25 30 Detection Time (s) 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Av g. M ist ak e Ra te AdaBoost PR NN Random Forest PR SVM PR Decision Tree PR Logistic Regression PR BDBD TTF Stacking PR BlendStacking Ensemble (b) Result over the time range 0 to 30 seconds. Figure 5.5: The trade-off between TD and λM for the results of Cfallback,equal and CP D. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Detection Time (s) 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Av g. M ist ak e Ra te AdaBoost PR NN Random Forest PR SVM PR Decision Tree PR Logistic Regression PR BDBD TTF Stacking PR BlendStacking Ensemble (a) Result for the zoomed-in time range 0 to 3 seconds. 0 5 10 15 20 25 30 Detection Time (s) 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Av g. M ist ak e Ra te AdaBoost PR NN Random Forest PR SVM PR Decision Tree PR Logistic Regression PR BDBD TTF Stacking PR BlendStacking Ensemble (b) Result over the time range 0 to 30 seconds. Figure 5.6: The trade-off between TD and λM for the results of Cfallback,lower and CP D. • The Naive Bayes classifier contributes to the majority of FNs, whereas the remaining classifiers exhibit a relatively small number of FNs. The occurrence of crash intervals wherein more than three ML-Classifiers report FN is infrequent. Based on this observation, it is appropriate to set the number of clusters utilized by the K-means algorithm to two for the majority of crash intervals. • Throughout the majority of crash intervals, the occurrence of FNs is typically reported by each classifier only once. In instances where a classifier does report FNs multiple times within a single crash interval, it is observed that these instances of FN reports from the classifier are predominantly contiguous. This finding implies that situations akin to the "Cried Wolf" phenomenon, where a classifier frequently oscillates between FN and True Positive (TP) classifications, 35 5. Results are very uncommon. An attempt was also made to assign weights to various samples using a decreasing arithmetic progression. However, the introduction of weights yielded negligible impact for the majority of crash periods. 5.3.2 Crash-free Intervals The ML-Classifiers were also grouped into clusters during the crash-free intervals in which at least one ML-Classifier reported FP using K-means algorithm [28]. Again, clustering of ML-Classifiers was unnecessary during crash-free intervals where no ML-Classifier reported FP. The entire test dataset consisted of 3,569 crash-free intervals, of which instances were observed where only one classifier reported FP at some data points, while at other data points, two or more classifiers reported FPs. The Logistics Regression classifier and the Naive Bayes classifier generated the most FPs, whereas the remaining classifiers showed relatively consistent numbers of FPs generated. From the Table 5.1 and the clustering results (table A.2) from the K-means algorithm, we notice that: • Instances wherein more than two classifiers simultaneously report FPs at the same data point are rare. Therefore, we explored the experiments with two and three clusters when using the K-means algorithm. Setting the number of clusters to a higher value did not yield substantial or meaningful outcomes. • In most cases, the Logistics Regression classifier and the Naive Bayes classifier were consistently clustered together as one group, while the remaining classifiers were grouped separately as another distinct cluster. 5.3.3 Ranking of ML-Based Binary Classifiers To determine the presence of structural patterns within these classifiers, a ranking in terms of specificity and sensitivity was conducted. The result of this ranking is presented in Table 5.4. The top-performing classifiers based on this ranking include Random Forest, AdaBoost, SVM, and Decision Tree. 5.3.3.1 Pairwise Comparison The comparison results of Random Forest, SVM and Decision Tree are presented in Table 5.7. This table shows that the closest two classifiers are Random Forest and SVM. Furthermore, the comparison of Random Forest, SVM and AdaBoost is presented in Table 5.8, which shows that AdaBoost is closer to SVM. 5.4 Li-Marin LSTM FD In this section, we present the result of Li-Marin LSTM FD for lower and higher over-estimation. The confusion matrix data for LSTM FD is presented in Table 36 5. Results SVM vs Decision Tree Random Forest vs Decision Tree Random Forest vs SVM FP (SVM & Decision Tree): 3727753 (Random Forest & De- cision Tree): 3727851 (Random Forest & SVM): 3727890 (¬SVM & Decision Tree): 104 (¬Random Forest & Decision Tree): 6 (¬Random Forest & SVM): 34 (SVM & ¬Decision Tree): 171 (Random Forest & ¬Decision Tree): 315 (Random Forest & ¬SVM): 276 (¬SVM & ¬Decision Tree): 694 (¬Random Forest & ¬Decision Tree): 550 (¬Random Forest & ¬SVM): 522 FN (SVM & Decision Tree): 12118132 (Random Forest & De- cision Tree): 12118135 (Random Forest & SVM): 12118143 (¬SVM & Decision Tree): 8 (¬Random Forest & Decision Tree|: 5 (¬Random Forest & SVM): 66 (SVM & ¬Decision Tree): 77 (Random Forest & ¬Decision Tree): 20 (Random Forest & ¬SVM): 12 (¬SVM & ¬Decision Tree): 152 (¬Random Forest & ¬Decision Tree): 209 (¬Random Forest & ¬SVM): 148 Table 5.7: The pairwise comparison of classifiers Random Forest, SVM and Decision Tree. SVM vs Adaboost Random Forest vs Adaboost FP (SVM & Adaboost): 3727876 (Random Forest & Adaboost): 3728055 (¬SVM & Adaboost): 48 (¬Random Forest & Adaboost): 111 (SVM & ¬Adaboost): 276 (Random Forest & ¬Adaboost): 97 (¬SVM & ¬Adaboost): 522 (¬Random Forest & ¬Adaboost): 459 FN (SVM & Adaboost): 12118142 (Random Forest & Adaboost): 12118142 (¬SVM & Adaboost): 67 (¬Random Forest & Adaboost): 13 (SVM & ¬Adaboost): 117 (Random Forest & ¬Adaboost): 117 (¬SVM & ¬Adaboost): 43 (¬Random Forest & ¬Adaboost): 97 Table 5.8: The comparisons of classifiers SVM vs Adaboost and Random Forest vs Adaboost . 5.9, along with the result of LSTM binary classifier FD implemented by Häger and Köre in [9]. Table 5.10 shows the classification metrics that are derived from Table 5.9. Note that the evaluation of the LSTM FD involves assessing its probability of availability (PA) and the classification metrics including Accuracy, Specificity, Precision, Sensitivity and Fall-Out. As expected, the Li-Marin LSTM model with higher over-estimation tends to generate more FN errors, incorrectly identifying faulty nodes as non-faulty. Conversely, when exhibiting lower over-estimation, it tends to generate more FP errors, wrongly 37 5. Results LSTM FD FP TP FDR FN TN FOR LSTM Binary Classifier FD 33829 3684769 0.9097% 1872 12116930 0.0154% Li-Marin LSTM with lower over-estimation 24122 387254201 0.0062% 7551 107519792 0.0070% Li-Marin LSTM with higher over-estimation 6818 386813574 0.0018% 12473 107530313 0.0160% Table 5.9: The confusion matrix data results of LSTM classifier and LSTM FD. LSTM FD PA Accuracy Sensitivity Specificity Precision Fall-Out AUC ROC LSTM Binary Classifier FD 0.997746 0.999492 0.997216 0.990903 0.002784 0.998354 Li-Marin LSTM with lower Over- estimation 0.998643 0.999537 0.999972 0.999291 0.997179 0.000087 Li-Marin LSTM with higher Over- estimation 0.998547 0.999101 0.999867 0.999941 0.999788 0.000059 Table 5.10: The PA and classification metrics results of LSTM classifier FD and the Li-Marin LSTM FD at lower and higher degree of over-estimation. identifying non-faulty nodes as faulty. Table 5.9 shows that the Li-Marin LSTM FD outperforms the LSTM classifier FD in terms of FP errors, regardless of whether it has a lower or higher over-estimation. However, concerning FN errors, their performances are comparable. Besides, as shown in Table 5.10, the Li-Marin LSTM FD performs better than the LSTM classifier FD across all metrics. 5.5 Discussion of Results 5.5.1 Interpretation We evaluated the performance of seven individual ML-based FDs, As detailed in Section 5.1, the evaluation on our dataset reveals different levels of FP and FN across these FDs. Among the individual ML-based FDs, three FDs tend to produce more FP errors, and four FDs lean towards generating more FN errors. In Section 5.3, we show that they have a certain structural pattern. The Random Forest, SVM, and AdaBoost FDs are close to each other in terms of FP and FN levels. Except for Naive Bayes FD, the accuracy of all individual ML-based FDs is three nines, which indicates that ML-based FDs are effective for the failure nodes detection problem. Moreover, all ML-based FDs more or less produced FP and FN errors, which may indicate that some data points are difficult to classify. The results of the configuration Cfallback in Section 5.2 show that using BDBD as a fallback can enhance the performance of ML-based FDs in certain scenarios while diminishing it in others. Most ML-based FDs remained unaffected by the fallback in Cfallback,higher. This could be attributed to the FDs having already determined their final suspicions before the fallback was triggered. The ML-based FDs in Cfallback,equal demonstrated improvements in detection time compared to CP D with the help of fallback. Conversely, in Cfallback,lower, the fallback caused longer detection time for the ML-based FDs. This might be because the fallback threshold is lower than the threshold of the PD module, leading to the suspicions made by ML-based FDs being 38 5. Results overridden by the fallback. Section 5.2 reveals that ML-based FDs exhibit a shorter detection time for failure nodes compared to analytical FDs, namely BDBD and TTF. However, analytical FDs can achieve a higher accuracy when the detection time is longer. The ensemble FDs combine the advantages of both ML-based FDs and analytical FDs to a certain extent. Specifically, the Stacking and Ensemble FDs can achieve a higher accuracy than individual ML-based FDs. Nevertheless, since they are built on top of part or all of ML-based FDs, they are still subject to the characteristics of ML-based FDs. Consequently, their accuracies still fall short of analytical FDs when the detection time becomes longer. The BlendStacking FD overcomes the shortcomings of ML-based FDs and the analytical FDs. That is to say, it can identify failure nodes more rapidly than analytical FDs, and when the detection time becomes longer, it can achieve the same accuracy as analytical FDs. 5.5.2 Limitations Since our experiments were conducted on the “simulative” environment as described in Section 4.1, the results only reflect the performance of FDs in this environment. It does not represent the performance of FDs in a real environment, where some non-faulty nodes may be slow to reply. Furthermore, to deploy these FDs in a real-world setting, they need to be retrained using the data in the real environment. Given that we use the dataset from Häger and Köre, the limitations associated with their dataset also apply to our study. The test dataset has only 15.8×106 data points, and the number of data points corresponding to results with λM ≤ 10−5 is even smaller. This is because such data points are generated based on specific conditions. One should be cautious when interpreting our results. Furthermore, this concern is exacerbated by the specificity of the results, which stem from data collected from only eight highly stable nodes within a relatively short period spanning a few weeks. The p