Reliability and Failure Impact Analysis of Distributed Storage Systems with Dynamic Refuging

被引:0
|
作者
Akutsu, Hiroaki [1 ]
Ueda, Kazunori [2 ]
Chiba, Takeru [1 ]
Kawaguchi, Tomohiro [1 ]
Shimozono, Norio [1 ]
机构
[1] Hitachi Ltd, Res & Dev Grp, Yokohama, Kanagawa 2440817, Japan
[2] Waseda Univ, Tokyo 1698555, Japan
来源
关键词
erasure coding; highly redundant storage systems; reliability; rebuild; Monte Carlo simulation; RAID;
D O I
10.1587/transinf.2016EDP7139
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: ( i) as the number of drives increases, systems are more subject to multiple drive failures and ( ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. If data loss occurs by multiple drive failure, it affects many users using a storage system. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed blocks from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamic change of amount of storage at each redundancy level caused by multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. We showed a failure impact model and a method for localizing the failure. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, the proposed technique turned out to scale well, and the probability of data loss decreased by two orders of magnitude for systems with a thousand drives compared to normal RAID. The appropriate setting of a stripe distribution level could localize the failure.
引用
收藏
页码:2259 / 2268
页数:10
相关论文
共 50 条
  • [41] Unequal Failure Protection Coding Technique for Distributed Cloud Storage Systems
    Hu, Yupeng
    Liu, Yonghe
    Li, Wenjia
    Li, Keqin
    Li, Kenli
    Xiao, Nong
    Qin, Zheng
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2021, 9 (01) : 386 - 400
  • [42] Stability analysis for spatially distributed dynamic systems
    Zhou Tong
    [J]. PROCEEDINGS OF THE 26TH CHINESE CONTROL CONFERENCE, VOL 3, 2007, : 54 - 58
  • [43] On the Scalable Dynamic Taint Analysis for Distributed Systems
    Fu, Xiaoqin
    [J]. ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, : 1247 - 1249
  • [44] Distributed Storage Allocation for High Reliability
    Leong, Derek
    Dimakis, Alexandros G.
    Ho, Tracey
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2010,
  • [45] Reliability study of coding schemes for wide-area distributed storage systems
    Peter, Kathrin
    [J]. PROCEEDINGS OF THE 19TH INTERNATIONAL EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING, 2011, : 19 - 23
  • [46] A Reliability Analysis and Comparison of Battery Energy Storage Systems
    Liu, Haiyang
    Panteli, Mathaios
    [J]. PROCEEDINGS OF 2019 IEEE PES INNOVATIVE SMART GRID TECHNOLOGIES EUROPE (ISGT-EUROPE), 2019,
  • [47] RELIABILITY-ANALYSIS OF PRODUCTION SYSTEMS WITH BUFFER STORAGE
    ELSAYED, EA
    TURLEY, RE
    [J]. INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH, 1980, 18 (05) : 637 - 645
  • [48] Reliability Analysis of Battery Energy Storage Systems: An Overview
    Wang, Hanyun
    Shao, Yue
    Zhou, Lihua
    [J]. 2022 IEEE/IAS INDUSTRIAL AND COMMERCIAL POWER SYSTEM ASIA (I&CPS ASIA 2022), 2022, : 2036 - 2040
  • [49] Performance Analysis on Distributed Storage Systems in Ring Networks
    Qu, Shan
    Zhang, Qin
    Zhang, Jinbei
    Sun, Yuan
    Wang, Xinbing
    [J]. IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2020, 69 (07) : 7762 - 7777
  • [50] Performance Analysis on Distributed Storage Systems in Ring Networks
    Zhang, Qin
    Sun, Yuan
    Qu, Shan
    Zhang, Jinbei
    Wang, Xinbing
    [J]. 2017 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA (ICCC), 2017, : 459 - 464