Evaluation of distributed recovery in large-scale storage systems

被引:41
|
作者
Xin, Q [1 ]
Miller, EL [1 ]
Schwarz, TJE [1 ]
机构
[1] Univ Calif Santa Cruz, Storage Syst Res Ctr, Santa Cruz, CA 95064 USA
关键词
D O I
10.1109/HPDC.2004.1323523
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Storage clusters consisting of thousands of disk drives are now being used both for their large capacity and high throughput. However, their reliability is far worse than that of smaller storage systems due to the increased number of storage nodes. RAID technology is no longer sufficient to guarantee the necessary high data reliability for such systems, because disk rebuild nine lengthens as disk capacity grows. In this paper, we present FAst Recovery Mechanism (FARM), a distributed recover), approach that exploits excess disk capacity and reduces data recovery time. FARM works in concert with replication and erasure-coding redundancy schemes to dramatically lower the probability of data loss in large-scale storage systems. We have examined essential factors that influence system reliability, performance, and costs, such as failure detections, disk bandwidth usage for recovery, disk space utilization, disk drive replacement, and system scales, by simulating system behavior under disk failures. Our results show the reliability improvement from FARM and demonstrate the impacts of various factors on system reliability. Using our techniques system designers will be better able to build multi-petabyte storage systems with much higher reliability at lower cost than previously possible.
引用
收藏
页码:172 / 181
页数:10
相关论文
共 50 条
  • [41] Ergodic dynamics for large-scale distributed robot systems
    Shell, Dylan A.
    Mataric, Maja J.
    [J]. UNCONVENTIONAL COMPUTATION, PROCEEDINGS, 2006, 4135 : 254 - 266
  • [42] Efficient Distributed Test Architectures for Large-Scale Systems
    de Almeida, Eduardo Cunha
    Marynowski, Joao Eugenio
    Sunye, Gerson
    Le Traon, Yves
    Valduriez, Patrick
    [J]. TESTING SOFTWARE AND SYSTEMS, 2010, 6435 : 174 - +
  • [43] New Advances in Distributed Control of Large-Scale Systems
    Zhang, Dan
    Zhang, Wen-An
    Wu, Zheng-Guang
    Liu, Kun
    Zhang, Hui
    Zhao, Yun-Bo
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2015, 2015
  • [44] INTEROPERABILITY ISSUES IN LARGE-SCALE DISTRIBUTED OBJECT SYSTEMS
    MANOLA, F
    [J]. ACM COMPUTING SURVEYS, 1995, 27 (02) : 268 - 270
  • [45] Energy Efficiency in Large-Scale Distributed Computing Systems
    Trobec, R.
    Depolli, M.
    Skala, K.
    Lipic, T.
    [J]. 2013 36TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2013, : 253 - 257
  • [46] Distributed Speed Scaling in Large-Scale Service Systems
    Rutten, Daan
    Zubeldia, Martin
    Mukherjee, Debankur
    [J]. Performance Evaluation Review, 2024, 52 (01): : 95 - 96
  • [47] A Scalable Monitoring Solution for Large-Scale Distributed Systems
    Buga, Andreea
    [J]. COMPUTER AIDED SYSTEMS THEORY - EUROCAST 2015, 2015, 9520 : 219 - 227
  • [48] On the reliability of large-scale distributed systems - A topological view
    He, Yuan
    Ren, Hao
    Liu, Yunhao
    Yang, Baijian
    [J]. COMPUTER NETWORKS, 2009, 53 (12) : 2140 - 2152
  • [49] Distributed Ledgers in Developing Large-Scale Integrated Systems
    Marchini, Michael F.
    [J]. 2021 15TH ANNUAL IEEE INTERNATIONAL SYSTEMS CONFERENCE (SYSCON 2021), 2021,
  • [50] MODELS FOR CONFIGURING LARGE-SCALE DISTRIBUTED COMPUTING SYSTEMS
    GAVISH, B
    [J]. AT&T TECHNICAL JOURNAL, 1985, 64 (02): : 491 - 532