Detecting failures in distributed systems with the FALCON spy network

Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish

Summary

This paper mainly focuses on designing and implementing a reliable failure detector (RFD), which is a hard problem and thus receives less attention. Previous approaches to failure detection use the timeout mechanism, which may lead to tens of seconds of unavailability. The authors also discuss end-to-end timeouts and assert that there is no good end-to-end timeout mechanism for now.

In this paper, the authors present Falcon, a failure detector that leverages internal knowledge from various system layers to achieve: (1) sub-second crash detection time, (2) reliability, and (3) little disruption. Their goals are to improve applications' availability and reduce applications' complexity. The authors start from an observation that many crashes can be observed readily by looking at the right (or closest) layer of the system. Therefore, Falcon utilizes a network of spy modules or spies to monitor failure events in different layers, which only requires a small amount of platform-specific logic. Besides, spies are arranged in a chained network so that the spy in one layer can monitor the spy at the next layer up. Furthermore, in this paper, the authors also implement and evaluate a version of Falcon, which deploys spies on four layers: (1) application, (2) OS, (3) virtual machine monitor, and (4) network switch. Their results demonstrate the effectiveness of Falcon.

Contributions

This is the first viable and fast RFD, which gives up the design idea of timeout mechanism. Previous RFDs all have limitations of large timeouts and disruption from small timeouts.
Falcon relies on a novel spy network rather than timeout mechanism to achieve the ideal design goals: reliability, short detection time, and little disruption. The new aspects of this spy network are (1) layer-specific monitors, (2) a network of chained spies, and (3) composing these two with existing techniques.
Apart from the high-level design ideas, the authors also provide a concrete, complete, and sound design for Falcon. Moreover, they present a concrete implementation of Falcon.
The authors conduct extensive evaluations to examine and prove the effectiveness of Falcon.

Strengths

Falcon has high availability. Add Falcon to ZooKeeper and to a replication library reduces unavailability after some crashes by roughly six times.
Falcon can simplify applications' implementations that use failure detectors. For example, with Falcon, a replicated state machine can replace Paxos with the primary-backup mechanism.
Falcon can achieve sub-second detection time, which is much faster than baseline approaches.

Weaknesses

Falcon cannot guarantee that spies will not ignore any crash failures. Although the authors introduce a large end-to-end timeout as a backstop, this timeout mechanism may lead to a long time of unavailability.
To reliably report process status (i.e., UP or DOWN), Falcon should be able to communicate with the remote system. However, it cannot avoid the situation of the network partition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2018.03.06-Falcon.md

2018.03.06-Falcon.md

Detecting failures in distributed systems with the FALCON spy network

Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish

Summary

Contributions

Strengths

Weaknesses

Files

2018.03.06-Falcon.md

Latest commit

History

2018.03.06-Falcon.md

File metadata and controls

Detecting failures in distributed systems with the FALCON spy network

Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish

Summary

Contributions

Strengths

Weaknesses