When Crash Fault Tolerance Meets Machine Learning

Publicerad

Typ

Examensarbete för masterexamen

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Fault tolerance is vital for distributed systems as it allows them to operate in networks where nodes may experience failures. Several properties are of interest when considering fault tolerance, but we focus on safety i.e., system consistency, and liveness i.e., algorithmic progression. Many different distributed applications can be realized through various algorithms, some of which provide safety, but not necessarily liveness. In an algorithm that only provides safety, it is possible that nodes may stop responding, e.g., if they experience crash failures. If this happens in an asynchronous system, it is impossible to know, by the well-known FLP impossibility result, if the node is crashed or if it is simply abnormally slow. Because of this, mechanisms are needed to circumvent the FLP impossibility in such systems. We study one such mechanism, the unreliable Failure Detector, which is an augmentation of the asynchronous model. For this, we consider systems in which it is costly to make mistakes i.e., faulty suspicions. Particularly, we model failure detection as a binary classification problem through our simple and generic parameter model that utilizes both timed and time-free parameters, i.e., those calculated by using clocks (or timers), and those calculated by counting round-trip completions in the system. By this, we answer our research question can Machine Learning-based Failure Detectors balance the trade-off between the detection time, the probability of a faulty suspicion and the cost of a faulty suspicion better than existing solutions? with an affirmative through a broad range of Machine Learning-based Failure Detectors, for which many classifiers serve as the basis. We also present a method to lower the probability of a false suspicion by analyzing the precision of the classifiers. Our results show that the learning task is suitable in a Federated Learning setting, where we use FedDyn, which is promising as it inherently implies scalability. We find that our best Failure Detector, which uses Random Forest, is able to lower the detection time by up to 86.6% in comparison to an existing solution whilst not making more mistakes.

Beskrivning

Ämne/nyckelord

Unreliable Failure Detectors, Machine Learning, Distributed Systems, Federated Learning, Fault Tolerance, Asynchronous Systems, Consensus

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced