When Crash Fault Tolerance Meets Machine Learning
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Fault tolerance is vital for distributed systems as it allows them to operate in networks
where nodes may experience failures. Several properties are of interest when
considering fault tolerance, but we focus on safety i.e., system consistency, and liveness
i.e., algorithmic progression. Many different distributed applications can be
realized through various algorithms, some of which provide safety, but not necessarily
liveness. In an algorithm that only provides safety, it is possible that nodes
may stop responding, e.g., if they experience crash failures. If this happens in an
asynchronous system, it is impossible to know, by the well-known FLP impossibility
result, if the node is crashed or if it is simply abnormally slow. Because of this,
mechanisms are needed to circumvent the FLP impossibility in such systems. We
study one such mechanism, the unreliable Failure Detector, which is an augmentation
of the asynchronous model. For this, we consider systems in which it is costly to
make mistakes i.e., faulty suspicions. Particularly, we model failure detection as a
binary classification problem through our simple and generic parameter model that
utilizes both timed and time-free parameters, i.e., those calculated by using clocks
(or timers), and those calculated by counting round-trip completions in the system.
By this, we answer our research question can Machine Learning-based Failure Detectors
balance the trade-off between the detection time, the probability of a faulty
suspicion and the cost of a faulty suspicion better than existing solutions? with an
affirmative through a broad range of Machine Learning-based Failure Detectors, for
which many classifiers serve as the basis. We also present a method to lower the
probability of a false suspicion by analyzing the precision of the classifiers. Our results
show that the learning task is suitable in a Federated Learning setting, where
we use FedDyn, which is promising as it inherently implies scalability. We find that
our best Failure Detector, which uses Random Forest, is able to lower the detection
time by up to 86.6% in comparison to an existing solution whilst not making more
mistakes.
Beskrivning
Ämne/nyckelord
Unreliable Failure Detectors, Machine Learning, Distributed Systems, Federated Learning, Fault Tolerance, Asynchronous Systems, Consensus