Evaluation of distributed recovery in large-scale storage systems

被引:41
|
作者
Xin, Q [1 ]
Miller, EL [1 ]
Schwarz, TJE [1 ]
机构
[1] Univ Calif Santa Cruz, Storage Syst Res Ctr, Santa Cruz, CA 95064 USA
关键词
D O I
10.1109/HPDC.2004.1323523
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Storage clusters consisting of thousands of disk drives are now being used both for their large capacity and high throughput. However, their reliability is far worse than that of smaller storage systems due to the increased number of storage nodes. RAID technology is no longer sufficient to guarantee the necessary high data reliability for such systems, because disk rebuild nine lengthens as disk capacity grows. In this paper, we present FAst Recovery Mechanism (FARM), a distributed recover), approach that exploits excess disk capacity and reduces data recovery time. FARM works in concert with replication and erasure-coding redundancy schemes to dramatically lower the probability of data loss in large-scale storage systems. We have examined essential factors that influence system reliability, performance, and costs, such as failure detections, disk bandwidth usage for recovery, disk space utilization, disk drive replacement, and system scales, by simulating system behavior under disk failures. Our results show the reliability improvement from FARM and demonstrate the impacts of various factors on system reliability. Using our techniques system designers will be better able to build multi-petabyte storage systems with much higher reliability at lower cost than previously possible.
引用
收藏
页码:172 / 181
页数:10
相关论文
共 50 条
  • [1] Independent recovery in large-scale distributed systems
    Triantafillou, P
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1996, 22 (11) : 812 - 826
  • [2] Key Management for Large-Scale Distributed Storage Systems
    Lim, Hoon Wei
    [J]. PUBLIC KEY INFRASTRUCTURES, SERVICES AND APPLICATIONS, 2010, 6391 : 99 - 113
  • [3] A Data Storage Approach for Large-Scale Distributed Medical Systems
    de Macedo, Douglas D. J.
    von Wangenheim, Aldo
    Dantas, Mario A. R.
    [J]. 2015 9TH INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT, AND SOFTWARE INTENSIVE SYSTEMS CISIS 2015, 2015, : 486 - 490
  • [4] Approximate Reliability Evaluation of Large-Scale Distributed Systems
    Mo, Yuchang
    Han, Jianmin
    Zhang, Zhizheng
    Pan, Zhusheng
    Zhong, Farong
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2014, 30 (01) : 25 - 41
  • [5] An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems
    Gao, Yu
    Dou, Wensheng
    Qin, Feng
    Gao, Chushu
    Wang, Dong
    Wei, Jun
    Huang, Ruirui
    Zhou, Li
    Wu, Yongming
    [J]. ESEC/FSE'18: PROCEEDINGS OF THE 2018 26TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2018, : 539 - 550
  • [6] Storage optimization for large-scale distributed stream-processing systems
    Hildrum, Kirsten
    Douglis, Fred
    Wolf, Joel L.
    Yu, Philip S.
    Fleischer, Lisa
    Katta, Akshay
    [J]. ACM Transactions on Storage, 2008, 3 (04)
  • [7] Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation
    Lu, Yi
    Cheng, James
    Yan, Da
    Wu, Huanhuan
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (03): : 281 - 292
  • [8] On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems
    Zhu, Yunfeng
    Lee, Patrick P. C.
    Xu, Yinlong
    Hu, Yuchong
    Xiang, Liping
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (07) : 1830 - 1840
  • [9] Fault Tolerance Performance Evaluation of Large-Scale Distributed Storage Systems HDFS and Ceph Case Study
    Arafa, Yehia
    Barai, Atanu
    Zheng, Mai
    Badawy, Abdel-Hameed A.
    [J]. 2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2018,
  • [10] Design and performance evaluation of fail-over and recovery strategies for large-scale multimedia storage systems
    Zeng, Zeng
    Veeravalli, Bharadwaj
    Srivastava, Jaideep
    [J]. ICON: 2006 IEEE INTERNATIONAL CONFERENCE ON NETWORKS, VOLS 1 AND 2, PROCEEDINGS: NETWORKING -CHALLENGES AND FRONTIERS, 2006, : 9 - +