Reliability and Failure Impact Analysis of Distributed Storage Systems with Dynamic Refuging

被引：0

作者：

Akutsu, Hiroaki ^{[1
]}

Ueda, Kazunori ^{[2
]}

Chiba, Takeru ^{[1
]}

Kawaguchi, Tomohiro ^{[1
]}

Shimozono, Norio ^{[1
]}

机构：

[1] Hitachi Ltd, Res & Dev Grp, Yokohama, Kanagawa 2440817, Japan

[2] Waseda Univ, Tokyo 1698555, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2016年 / E99D卷 / 09期

关键词：

erasure coding; highly redundant storage systems; reliability; rebuild; Monte Carlo simulation; RAID;

D O I：

10.1587/transinf.2016EDP7139

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: ( i) as the number of drives increases, systems are more subject to multiple drive failures and ( ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. If data loss occurs by multiple drive failure, it affects many users using a storage system. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed blocks from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamic change of amount of storage at each redundancy level caused by multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. We showed a failure impact model and a method for localizing the failure. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, the proposed technique turned out to scale well, and the probability of data loss decreased by two orders of magnitude for systems with a thousand drives compared to normal RAID. The appropriate setting of a stripe distribution level could localize the failure.

引用

页码：2259 / 2268

页数：10

共 50 条

[1] Reliability Analysis of Highly Redundant Distributed Storage Systems with Dynamic Refuging
Akutsu, Hiroaki
Ueda, Kazunori
Chiba, Takeru
Kawaguchi, Tomohiro
Shimozono, Norio
[J]. 23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015), 2015, : 261 - 268
[2] Network Aware Reliability Analysis for Distributed Storage Systems
Epstein, Amir
Kolodner, Elliot K.
Sotnikov, Dmitry
[J]. PROCEEDINGS OF 2016 IEEE 35TH SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS), 2016, : 249 - 258
[3] On the impact of replica placement to the reliability of distributed brick storage systems
Lian, Q
Chen, W
Zhang, Z
[J]. 25th IEEE International Conference on Distributed Computing Systems, Proceedings, 2005, : 187 - 196
[4] Reliability of distributed storage systems
Zhang, Wei
Ma, Jian-Feng
Yang, Xiao-Yuan
[J]. Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2009, 36 (03): : 480 - 485
[5] Dynamic multiple node failure recovery in distributed storage systems
Itani, May
Sharafeddine, Sanaa
ElKabani, Islam
[J]. AD HOC NETWORKS, 2018, 72 : 1 - 13
[6] Dynamic single node failure recovery in distributed storage systems
Itani, M.
Sharafeddine, S.
Elkabani, I.
[J]. COMPUTER NETWORKS, 2017, 113 : 84 - 93
[7] Analysis of Data Reliability Tradeoffs in Hybrid Distributed Storage Systems
Tang, Bing
Fedak, Gilles
[J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 1546 - 1555
[8] Reliability analysis of distributed storage systems considering data loss and theft
Jia, Heping
Peng, Rui
Ding, Yi
Shao, Changzheng
[J]. PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART O-JOURNAL OF RISK AND RELIABILITY, 2020, 234 (02) : 303 - 321
[9] Reliability analysis for dynamic configurations of systems with three failure modes
Pham, H
[J]. RELIABILITY ENGINEERING & SYSTEM SAFETY, 1999, 63 (01) : 13 - 23
[10] Reliability analysis of process controlled systems considering dynamic failure of components
Chandrakar A.
Nayak A.K.
Vinod G.
[J]. International Journal of System Assurance Engineering and Management, 2015, 6 (02) : 93 - 102

← 1 2 3 4 5 →