An Exponential Smoothing Adaptive Failure Detector in the Dual Model of Heartbeat and Interaction

  • cc icon
  • ABSTRACT

    In this paper, we propose a new implementation of a failure detector. The implementation uses a dual model of heartbeat and interaction. First, the heartbeat model is adopted to shorten the detection time, if the detection process does not receive the heartbeat message in the expected time. The interaction model is then used to check the process further. The expected time is calculated using the exponential smoothing method. Exponential smoothing can be used to estimate the next arrival time not only in the random data, but also in the data of linear trends. It is proven that the new detector in the paper can eventually be a perfect detector.


  • KEYWORD

    Failure detector , Exponential smoothing , Eventually perfect

  • I. INTRODUCTION

    A failure detector is a key building block for fault-tolerant distributed system, which provide a mechanism to collect information of process failure. Chandra and Toueg [1] introduced the concept of an unreliable failure detector and many fault-tolerance algorithms have been proposed based on unreliable failure detectors.

      >  The fixed timeout Δto is set as a constant value in a conventional failure detector and the monitoring process begins to suspect the monitored process if it does not receive a heartbeat message after time Δto. The disadvantage is that it is difficult to determine an appropriate Δto. If Δto is short, it is easy to make a wrong inference; if Δto is long, it results in a long time expense of detection time.

    Recently, some adaptive failure detectors have been presented [2-6]. Most of them are based on a heartbeat strategy and modify the timeout value dynamically according to the network conditions. Chen et al. [7] estimate (hereinafter, Chen’s estimation) the next heartbeat arrival time by computing the average transmission time and a fixed safety margin is added. Bertier et al. [8] combine Chen’s estimation and a dynamical safety margin based on Jacobson’s [9] estimation of the round-trip time.

    In addition, some studies [10-12] introduce Omega failure detectors (type of failure detectors) which judge process failure not according to the check point, but according to an output value φ (0 ≤ φ ≤1) in order to present the reliability of the process.

    The next part of this section will introduce the failure detector properties and the quality of service of the failure detector.

      >  A. Failure Detector Properties

    Failure detectors are characterized by two properties: completeness and accuracy. Two kinds of completeness and four kinds of accuracy are defined [1, 13].

    Completeness characterizes the failure detector's capability to suspect every incorrect process permanently. Two kinds of completeness are defined.

    1) Strong completeness: Eventually, every process that crashes is permanently suspected by every correct process. 2) Weak completeness: Eventually, every process that crashes is permanently suspected by some correct process.

    Accuracy is the characterization of the failure detector’s capability not to suspect correct processes. Four kinds of accuracy are defined.

    1) Strong accuracy: No process is suspected before it crashes. 2) Eventual strong accuracy: There is a time after which correct processes are not suspected by any correct process. 3) Weak accuracy: Some correct process is never suspected. 4) Eventual weak accuracy: There is a time after which some correct process is never suspected by any correct process.

    Eight classes of failure detectors are yielded by combining the two kinds of completeness and four kinds of accuracy, named as perfect, P; eventually perfect, ◇P; strong, S; eventually strong, ◇S; Q; ◇Q; weak, W; eventually weak, ◇W.

    Generally speaking, a good failure detector should be a perfect or eventually perfect detector. In this paper we design an eventually perfect detector. The detector meets the requirement of strong completeness and eventual strong accuracy.

      >  B. Quality of Service of Failure Detectors

    Some metrics have been proposed to specify the quality of service of a failure detector [14]. The main metrics are as follows:

    ● Detection time: The time from when the process crashes to the time when it is permanently suspected. ● Mistake recurrence time: The time between two consecutive mistakes. ● Mistake duration: The duration time for the failure detector to correct a mistake.

    The rest of this paper is organized as follows: Section II describes our failure detection model; Section III presents the new algorithm in our failure detector; Section IV proves that it is an eventually perfect failure detector; Section V presents the performance evaluation by a series of experiments; and Section VI consists of the conclusion of our research.

    Ⅱ. THE MODEL OF FAILURE DETECTION

      >  A. The Heartbeat Model

    The heartbeat model [15] is used in most distributed systems. Every process p periodically sends an “I am alive” heartbeat message to the process q. The period is the heartbeat interval Δi.

    If q does not receive a heartbeat message from p after a timeout delay Δto, p is added to the list of suspected processes. If q receives the heartbeat message from p later, then q removes p from its list of suspected processes, as shown in Fig. 1.

    The heartbeat interval Δii is the time between two emissions of the “I am alive” heartbeat message. The timeout delay Δtoto is the time between the last reception of the “I am alive” message from p and the time where q starts suspecting p. The transmission delay Δtrtr is the time between the emission of the heartbeat message and the reception of the heartbeat message.

      >  B. The Interaction Model

    The process q monitors every process p by sending an “Are you alive?” message to p periodically. Once p receives the message from q, it replies with an “I am alive” message to q. If q does not receive the message from p after a timeout delay Δto, p is added to the list of suspected processes. If q receives the heartbeat message from p later, then q removes p from its list of suspected processes, as shown in Fig. 2.

      >  C. The Dual Model of Heartbeat and Interaction

      >  The heartbeat failure detector sends half as many messages as the interaction detector for the same detection quality, so the heartbeat failure detector is used in most of the distributed systems. However, it often mistakes actual failure statuses due to loss of packets or long delays in a complicated network. The interaction failure detector needs more detection time and sends more messages than the heartbeat detector, but it can be implemented as a requirement, and the detection result is independent of the message.

    In this paper, we will use a dual model of heartbeat and interaction. The dual model is organized in two steps. First, the heartbeat model is used to detect the process failure. If a process p is suspected, then the interaction model is used to check the process p further. The detailed procedure is as shown in Fig. 3.

    Ⅲ. AN EXPONENTIAL SMOOTHING ADAPTIVE FAILURE DETECTOR

    Chen’s estimation is a classical adaptive failure detector algorithm. It estimates the arrival time of the next heartbeat message according to the historical time sequences of the heartbeat messages. In addition, a constant safety margin is added to raise the detector’s accuracy.

    The process q receives n heartbeat messages denoted m1, m2, m3, ..., mn. The receipt time of the messages are presented as A1, A2, A3, ..., An and Δi is the heartbeat interval. Then, the estimation arrival time of the next message can be expressed as follows:

    The timeout checking point is

    where is ∂ the constant safety margin.

    Chen’s failure detection algorithm can estimate the arrival time of the next heartbeat message dynamically. However, it uses a constant safety margin, and the suitable constant ∂ is difficult to determine in a condition with a complex network. In addition, the algorithm uses the average transmission delay of the most recent k messages to estimate the next transmission time. And it is only suitable for the fluctuation sequence with small fluctuations. If the delay sequences increase or decrease in a linear manner, Chen’s estimation does not correspond to the real delay.

    An improved estimation algorithm based on exponential smoothing is proposed to overcome the shortcomings of the base algorithm. Exponential smoothing [16] is a technique that can be applied to a time data sequence. It makes forecasts according to a series of historical data.

    The delay data sequence is represented by {xi}, xi = AiΔt*i and the estimation of the exponential smoothing algorithm is written as . The simplest form of exponential smoothing is given by the formulas

    where α is the smoothing factor, and 0 < α < 1.

    The above simple exponential smoothing does not perform well when there is a trend in the data. In such situations, double exponential smoothing is used to estimate the following data in a sequence with a linear trend. Double exponential smoothing works as follows:

    where ai is the estimated level at time i and bi is the estimated trend at time i.

    Then,

    Another improvement in the algorithm is that the dynamic safety margin is calculated using the mean square deviation. As shown in Fig. 1, if the two adjacent heartbeat messages have the same transmission delay, then, Ai+1 = Ait , so Ai+1 – (Ait) can represent the fluctuation in the adjacent transmission delay, and it may be positive or negative.

    Finally, we have

    Algorithm 1 is the exponential smoothing adaptive failure detector (ESA_FD) algorithm.

    Ⅳ. PROOF

    Our failure detection algorithm can meet the requirements requirements of class ◇ P. In other words, it has the properties of strong completeness and eventual strong accuracy.

    Theorem 1. Strong completeness.

    Every crashed process p is permanently suspected by every correct process

    ∃t0 :∀t ≥ t0, ∀q∈correct(t),∀q∈crashed p∈suspectq(t).

    Before providing proof, we defined several parameters:

    Δmax: The unknown bound time threshold on the message transmission between processes p and q.mk: The kth message that process p sends to q.tsk : The time when process p sends mk to q.trk : The time when process q receives mk.tc : The time of a crash of process p.

    When the process p crashes at tc, p stops sending “I am alive” messages. Then tsktc, so there is an upper bound tc + Δmax after the time when process q can’t receive any more messages from p.

    The next thing to be proven is that τi+1 exists as an upper bound.

    where xmax is the maximum value in {xi}.

    So there is an upper bound (i + 1) * Δi+ * xmax + 3Δmax in the worst condition after which process q starts suspecting process p if it does not receive any messages from p.

    Theorem 2. Eventual strong accuracy.

    Theorem 2 states that there is an upper bound time after which the correct processes are not suspected by any correct process.

    tbound, ∀ttbound, ∀q,

    pcorrect(t), psuspectq(t)

    However, since the maximum transmission delay is Δmax, the dual model of heartbeat and interaction is used in this paper. In the worst network conditions, the checkpoint isτi+2 Δmax, so a correct process will not be suspected by any correct process.

    Ⅴ. EXPERIMENT

    Two computers are used to build the experimental platform in a local area network. One machine runs a program to send heartbeat messages (like process p), and the other machine runs the monitored program to receive heartbeat messages (like process q). We simulate complex network conditions by uploading files at a random time, and neither machines fail during the experiments.

    In the paper, we compare ESA_FD to conventional FD and Chen’s FD. The conventional FD sets a fixed timeout Δto. If the monitor process does not receive the heartbeat message after Δto, it begins to suspect the monitored process. A good Δto is very important to the FD. If Δto is short, it’s easy to make a wrong suspicion. Otherwise, it results in a long time expense in detection time, but this is hard to determine under complex network conditions.

    Chen’s FD estimates the arrival time of the next heartbeat message dynamically and it uses the average transmission delay of the most recent k messages to compute the next transmission time. To improve the accuracy, an additional safety margin ∂ is added. However, ∂ is a constant value in Chen’s FD.

    Experiment 1. Transmission delay contrast.

    We count thirty transmission times of the heartbeat messages and the heartbeat interval is set to 2 seconds. We compute the transmission delay according to the estimation of the arrival time by using Chen’s FD and ESA_FD. The result is shown in Fig. 5.

    Since the average delay is used to estimate the arrival delay in Chen’s FD, Chen’s FD is only suitable for a series of random data and there is a big difference between the real transmission delay and Chen’s estimation when the data has a linear trend. Whereas the estimation in ESA_FD confirms the real transmission delay very well, exponential smoothing uses a weight value to emphasize the recent data and ESA_FD is suitable for the data not only for random but also for linear trends.

    Experiment 2. The average mistake rate contrast.

    Two groups of tests are involved in computing the average mistake rate. The heartbeat interval of the first group is set to 2 seconds and the other is set to 5 seconds. The fixed timeout in the conventional FD is set to 1.8 seconds and the safety margin in Chen’s FD is set to 0.5 seconds. The result is shown in Fig. 6.

    Since the conventional FD uses a fixed timeout to detect a failure, and it is difficult to choose the fixed value to fit complex network conditions, so the fixed timeout of 2 seconds may not be the most suitable value since it results in a high mistake rate of about 10%. Chen’s FD is better than the conventional FD, but it is not suitable for data with a linear trend. Therefore, the safety margin is an important adjustment for the result, and in the same manner as the conventional FD, it is hard to determine the value of the safety margin. The ESA_FD in this paper has a very low mistake rate providing accurate estimation to the transmission delay and dual detection starts while a process timeouts first. This further raises the accuracy of detection.

    Experiment 3. The average detection time contrast.

    The parameters are set as in Experiment 2. In the experiment, we count the detection time and the result is shown in Fig. 7.

    EFA_FD uses dual detection while a process is suspected, so it takes more time than Chen’s FD. However, since the estimation to the transmission is more precise, the amount of time for dual detection is shorter, and the difference between the Chen’s FD and ESA_FD is very small. So EFA_FD raises the detection accuracy by increasing detection time by a small margin.

    Ⅵ. CONCLUSION

    ESA_FD is an improved adaptive failure detector, which is more suitable for complex network conditions than Chen’s FD. On the one hand, ESA_FD uses exponential smoothing to predict the next data and the prediction through exponential smoothing is more accurate than Chen’s prediction, which is calculated with the average value, because Chen’s FD is only suited for random data with little fluctuation. Hence, ESA_FD can more accurately predict the series of data not only for random, but also in linear trends in applications where the weight value is of exponential smoothing. On the other hand, ESA_FD calculates the safety margin with the mean square deviation and it varies dynamically according the historical transmission delay. ESA_FD meets the requirements of strong completeness and eventually strong accuracy and the experimental results show that ESA_FD can increase the accuracy substantially by increasing the detection time by a small amount.

  • 1. Chandra T. D., Toueg S. 1996 Unreliable failure detectors for reliable distributed systems [Journal of the ACM] Vol.43 P.225-267 google doi
  • 2. Bansal S., Sharma S., Trivedi I. 2012 Adaptive staircase multiple failure detector for parallel and distributed image processing [in Proceedings of the 1st International Conference on Recent Advances in Information Technology] P.91-94 google
  • 3. Tian D., Mao T., Xie J. 2009 “Adaptive failure detection algorithm for grid systems,” in Fuzzy Information and Engineering google
  • 4. Ma T., Hillston J., Anderson S. 2010 On the quality of service of crash-recovery failure detectors [IEEE Transactions on Dependable and Secure Computing] Vol.7 P.271-283 google doi
  • 5. Wu C., Wu K., Feng L., Tian D. 2010 NN-SA based dynamic failure detector for services composition in distributed environment [in Advanced Data Mining and Applications] P.443-450 google
  • 6. Luan L., Liu S. F., Zhang X. J. 2008 Research and improvement of failure detector algorithm based on fresh point [Journal of Jilin University (Science Edition)] Vol.46 P.681-686 google
  • 7. Chen W., Toueg S., Aguilera M. K. 2002 On the quality of service of failure detectors [IEEE Transactions on Computers] Vol.51 P.561-580 google doi
  • 8. Bertier M., Marin O., Sens P. 2002 Implementation and performance evaluation of an adaptable failure detector [in Proceedings of the International Conference on Dependable Systems and Network] P.354-363 google
  • 9. Jacobson V. 1988 Congestion avoidance and control [ACM SIGCOMM Computer Communication Review] Vol.18 P.314-329 google doi
  • 10. Martin C., Larrea M., Jimenez E. 2007 On the implementation of the Omega failure detector in the crash-recovery failure model [in Proceedings of the 2nd International Conference on Availability, Reliability and Security] P.975-982 google
  • 11. Shi L., Hou Y. S. 2010 Adaptive failure detection model based on message delay prediction [Journal of Computer Applications] Vol.30 P.1312-1315 google
  • 12. Hayashibara N., Defago X., Yared R., Katayama T. 2004 The ? accrual failure detector [in Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems] P.66-78 google
  • 13. Liu B., Yang S., Shi L., Ding X., Zhang Q. 2011 Modeling of failure detector based on message delay prediction mechanism [Journal of Software] Vol.6 P.1821-1828 google
  • 14. Bertier M., Marin O., Sens P. 2003 Performance analysis of a hierarchical failure detector [in Proceedings of the International Conference on Dependable Systems and Networks] P.635-644 google
  • 15. Hayashibara N., Takizawa M. 2007 Performance analysis of the interrogation-based failure detector [in Proceedings of the 27th International Conference on Distributed Computing Systems Workshops] P.34 google
  • 16. Exponential smoothing google
  • [Fig. 1.] The heartbeat model.
    The heartbeat model.
  • [Fig. 2.] The interaction model.
    The interaction model.
  • [Fig. 3.] The dual model of heartbeat and interaction.
    The dual model of heartbeat and interaction.
  • [Fig. 5.] The transmission delay contrast. ESA_FD: exponential smoothing adaptive failure detector.
    The transmission delay contrast. ESA_FD: exponential smoothing adaptive failure detector.
  • [Fig. 6.] The average mistake rate contrast. ESA_FD: exponential smoothing adaptive failure detector.
    The average mistake rate contrast. ESA_FD: exponential smoothing adaptive failure detector.
  • [Fig. 7.] The average detection time contrast. ESA_FD: exponential smoothing adaptive failure detector.
    The average detection time contrast. ESA_FD: exponential smoothing adaptive failure detector.