Lifeguard: Local Health Awareness for More Accurate Failure Detection

被引:4
|
作者
Dadgar, Armon [1 ]
Phillips, James [1 ]
Currey, Jon [1 ]
机构
[1] HashiCorp Inc, San Francisco, CA 94105 USA
关键词
D O I
10.1109/DSN-W.2018.00017
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
SWIM is a peer-to-peer group membership protocol, with attractive scaling and robustness properties. However, our experience supporting an implementation of SWIM shows that a high rate of false positive failure detections (healthy members being marked as failed) is possible in certain real world scenarios, and that this is due to SWIM's sensitivity to slow message processing. To address this we propose a set of extensions to SWIM (together called Lifeguard), which employ heuristic measures of a failure detector's local health. In controlled tests, Lifeguard is able to reduce the false positive rate by more than 50x. Real world deployment of the extensions has significantly reduced support requests and observed instability. The need for this work points to the fail-stop failure model being overly simplistic for large datacenters, where the likelihood of some nodes experiencing transient CPU starvation, IO flakiness, random packet loss, or other non-crash problems becomes high. With increasing attention being given to these gray failures, we believe the local health abstraction may be applicable in a broad range of settings, including other kinds of distributed failure detectors.
引用
收藏
页码:22 / 25
页数:4
相关论文
共 50 条
  • [2] Anaemia in chronic heart failure: more awareness is required
    Pisaniello, A. D.
    Wong, D. T. L.
    Kajani, I.
    Robinson, K.
    Shakib, S.
    [J]. INTERNAL MEDICINE JOURNAL, 2013, 43 (09) : 999 - 1004
  • [3] OSNA enables more accurate detection of micrometastases
    Peter Sidaway
    [J]. Nature Reviews Clinical Oncology, 2018, 15 : 68 - 68
  • [4] Biomarkers for health, and overall life expectancy, are becoming more and more accurate
    Rogers, Lois
    [J]. EUROPEAN HEART JOURNAL, 2011, 32 (20) : 2467 - 2467
  • [5] AFDAN: Accurate Failure Detection protocol for MANETs
    Benkaouha, Haroun
    Adelli, Abdelkrim
    Badache, Nadjib
    Ben-Othman, Jalel
    Mokdad, Lynda
    [J]. 2015 INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE (IWCMC), 2015, : 733 - 738
  • [6] HEALTH-CARE INDUSTRY - MORE AUDITOR AWARENESS
    WILLIAMS, B
    [J]. JOURNAL OF ACCOUNTANCY, 1982, 154 (04): : 36 - &
  • [7] MACNet: A More Accurate and Convenient Pest Detection Network
    Hu, Yating
    Wang, Qijin
    Wang, Chao
    Qian, Yu
    Xue, Ying
    Wang, Hongqiang
    [J]. ELECTRONICS, 2024, 13 (06)
  • [8] Faster and more accurate edge detection on spiral architecture
    He, XJ
    Wu, Q
    Hintz, T
    Wang, HQ
    [J]. CISST'03: PROCEEDING OF THE INTERNATIONAL CONFERENCE ON IMAGING SCIENCE, SYSTEMS AND TECHNOLOGY, VOLS 1 AND 2, 2003, : 186 - 191
  • [9] Local radio to promote mental health awareness: a public health initiative
    Cocksedge, Karen A.
    Guliani, Joshana
    Henley, William
    Anderson, Tamsyn
    Roberts, Sara
    Reed, Laurence
    Skinnard, Daphne
    Fisher, Sarah
    Chapman, Beth
    Willcox, Joanna
    Wilkinson, Ellen
    Laugharne, Richard
    Shankar, Rohit
    [J]. BJPSYCH OPEN, 2019, 5 (04):
  • [10] Accurate Scene Text Detection Through Border Semantics Awareness and Bootstrapping
    Xue, Chuhui
    Lu, Shijian
    Zhan, Fangneng
    [J]. COMPUTER VISION - ECCV 2018, PT XVI, 2018, 11220 : 370 - 387