When Crash Fault Tolerance Meets Machine Learning

dc.contributor.authorHÄGER, GUSTAV
dc.contributor.authorKÖRE, JONATHAN
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.examinerHaghir Chehreghani, Morteza
dc.contributor.supervisorSchiller, Elad Michael
dc.date.accessioned2022-10-14T13:00:01Z
dc.date.available2022-10-14T13:00:01Z
dc.date.issued2022sv
dc.date.submitted2020
dc.description.abstractFault tolerance is vital for distributed systems as it allows them to operate in networks where nodes may experience failures. Several properties are of interest when considering fault tolerance, but we focus on safety i.e., system consistency, and liveness i.e., algorithmic progression. Many different distributed applications can be realized through various algorithms, some of which provide safety, but not necessarily liveness. In an algorithm that only provides safety, it is possible that nodes may stop responding, e.g., if they experience crash failures. If this happens in an asynchronous system, it is impossible to know, by the well-known FLP impossibility result, if the node is crashed or if it is simply abnormally slow. Because of this, mechanisms are needed to circumvent the FLP impossibility in such systems. We study one such mechanism, the unreliable Failure Detector, which is an augmentation of the asynchronous model. For this, we consider systems in which it is costly to make mistakes i.e., faulty suspicions. Particularly, we model failure detection as a binary classification problem through our simple and generic parameter model that utilizes both timed and time-free parameters, i.e., those calculated by using clocks (or timers), and those calculated by counting round-trip completions in the system. By this, we answer our research question can Machine Learning-based Failure Detectors balance the trade-off between the detection time, the probability of a faulty suspicion and the cost of a faulty suspicion better than existing solutions? with an affirmative through a broad range of Machine Learning-based Failure Detectors, for which many classifiers serve as the basis. We also present a method to lower the probability of a false suspicion by analyzing the precision of the classifiers. Our results show that the learning task is suitable in a Federated Learning setting, where we use FedDyn, which is promising as it inherently implies scalability. We find that our best Failure Detector, which uses Random Forest, is able to lower the detection time by up to 86.6% in comparison to an existing solution whilst not making more mistakes.sv
dc.identifier.coursecodeDATX05sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/305716
dc.language.isoengsv
dc.setspec.uppsokTechnology
dc.subjectUnreliable Failure Detectorssv
dc.subjectMachine Learningsv
dc.subjectDistributed Systemssv
dc.subjectFederated Learningsv
dc.subjectFault Tolerancesv
dc.subjectAsynchronous Systemssv
dc.subjectConsensussv
dc.titleWhen Crash Fault Tolerance Meets Machine Learningsv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 22-127 Häger Köre.pdf
Storlek:
2.39 MB
Format:
Adobe Portable Document Format
Beskrivning:

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.51 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: