When Crash Fault Tolerance Meets Machine Learning

HÄGER, GUSTAV; KÖRE, JONATHAN

When Crash Fault Tolerance Meets Machine Learning

dc.contributor.author	HÄGER, GUSTAV
dc.contributor.author	KÖRE, JONATHAN
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.examiner	Haghir Chehreghani, Morteza
dc.contributor.supervisor	Schiller, Elad Michael
dc.date.accessioned	2022-10-14T13:00:01Z
dc.date.available	2022-10-14T13:00:01Z
dc.date.issued	2022	sv
dc.date.submitted	2020
dc.description.abstract	Fault tolerance is vital for distributed systems as it allows them to operate in networks where nodes may experience failures. Several properties are of interest when considering fault tolerance, but we focus on safety i.e., system consistency, and liveness i.e., algorithmic progression. Many different distributed applications can be realized through various algorithms, some of which provide safety, but not necessarily liveness. In an algorithm that only provides safety, it is possible that nodes may stop responding, e.g., if they experience crash failures. If this happens in an asynchronous system, it is impossible to know, by the well-known FLP impossibility result, if the node is crashed or if it is simply abnormally slow. Because of this, mechanisms are needed to circumvent the FLP impossibility in such systems. We study one such mechanism, the unreliable Failure Detector, which is an augmentation of the asynchronous model. For this, we consider systems in which it is costly to make mistakes i.e., faulty suspicions. Particularly, we model failure detection as a binary classification problem through our simple and generic parameter model that utilizes both timed and time-free parameters, i.e., those calculated by using clocks (or timers), and those calculated by counting round-trip completions in the system. By this, we answer our research question can Machine Learning-based Failure Detectors balance the trade-off between the detection time, the probability of a faulty suspicion and the cost of a faulty suspicion better than existing solutions? with an affirmative through a broad range of Machine Learning-based Failure Detectors, for which many classifiers serve as the basis. We also present a method to lower the probability of a false suspicion by analyzing the precision of the classifiers. Our results show that the learning task is suitable in a Federated Learning setting, where we use FedDyn, which is promising as it inherently implies scalability. We find that our best Failure Detector, which uses Random Forest, is able to lower the detection time by up to 86.6% in comparison to an existing solution whilst not making more mistakes.	sv
dc.identifier.coursecode	DATX05	sv
dc.identifier.uri	https://hdl.handle.net/20.500.12380/305716
dc.language.iso	eng	sv
dc.setspec.uppsok	Technology
dc.subject	Unreliable Failure Detectors	sv
dc.subject	Machine Learning	sv
dc.subject	Distributed Systems	sv
dc.subject	Federated Learning	sv
dc.subject	Fault Tolerance	sv
dc.subject	Asynchronous Systems	sv
dc.subject	Consensus	sv
dc.title	When Crash Fault Tolerance Meets Machine Learning	sv
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.uppsok	H

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 22-127 Häger Köre.pdf
Size:: 2.39 MB
Format:: Adobe Portable Document Format
Description:

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 1.51 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen