DRAM Failure Prediction in Large-Scale Data Centers

被引:11
|
作者
Yu, Fengyuan [1 ]
Xu, Hongzuo [2 ]
Jian, Songlei [1 ]
Huang, Chenlin [1 ]
Wang, Yijie [2 ]
Wu, Zhiyue [2 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Changsha, Peoples R China
[2] Sci & Technol Parallel & Distributed Proc Lab, Changsha, Peoples R China
基金
中国国家自然科学基金; 国家教育部科学基金资助;
关键词
DRAM error; Tree-based Model; Feature Engineering;
D O I
10.1109/JCC53141.2021.00012
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cloud computing is developing rapidly. Data centers are important infrastructures of cloud service and JointCloud structure. DRAM failure is one of the main causes which can lead to node outage in data centers. This paper proposes a decision-tree-based DRAM failure prediction method for large-scale data centers of cloud service. We utilize the first public-available DRAM failure prediction dataset released in PAKDD 2021 AIOps competition. We construct a suite of handcrafted features based on the system kernel log data and MCA log data. Feature engineering is detailedly introduced in this paper, which can inspire and foster future research in this field. Harnessing the power of a state-of-the-art classifier (i.e., XGBoost), our method can effectively and timely predict DRAM failures. Our solution has good performance on the PAKDD 2021 dataset, it can generally achieve more than 60% precision in the validation phase. Extensive experiments investigate the performance of variants of our method to validate the significance of different strategies in the proposed solution.
引用
收藏
页码:1 / 8
页数:8
相关论文
共 50 条
  • [1] On Workload-Aware DRAM Failure Prediction in Large-Scale Data Centers
    Wang, Xingyi
    Li, Yu
    Chen, Yiquan
    Wang, Shiwen
    Du, Yin
    He, Cheng
    Zhang, YuZhong
    Chen, Pinan
    Li, Xin
    Song, Wenjun
    Xu, Qiang
    Jiang, Li
    [J]. 2021 IEEE 39TH VLSI TEST SYMPOSIUM (VTS), 2021,
  • [2] Network Virtualization for Large-Scale Data Centers
    Ando, Tatsuhiro
    Shimokuni, Osamu
    Asano, Katsuhito
    [J]. FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 2013, 49 (03): : 292 - 299
  • [3] An Improved LSTM-Based Prediction Approach for Resources and Workload in Large-Scale Data Centers
    Yuan, Haitao
    Bi, Jing
    Li, Shuang
    Zhang, Jia
    Zhou, MengChu
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (12): : 22816 - 22829
  • [4] A survey on failure prediction of large-scale server clusters
    Xue, Zhenghua
    Dong, Xiaoshe
    Ma, Siyuan
    Dong, Weiqing
    [J]. SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 2, PROCEEDINGS, 2007, : 733 - +
  • [5] HHS: an efficient network topology for large-scale data centers
    Sadoon Azizi
    Naser Hashemi
    Ahmad Khonsari
    [J]. The Journal of Supercomputing, 2016, 72 : 874 - 899
  • [6] OpenPOWER: Reengineering a server ecosystem for large-scale data centers
    Gschwind, Michael
    [J]. 2014 IEEE HOT CHIPS 26 SYMPOSIUM (HCS), 2014,
  • [7] Large-Scale Modeling of Critical Telecommunications Facilities and Data Centers
    Bodi, Frank
    [J]. INTELEC 08 - 30TH INTERNATIONAL TELECOMMUNICATIONS ENERGY, VOLS 1 AND 2, 2008, : 229 - 236
  • [8] HHS: an efficient network topology for large-scale data centers
    Azizi, Sadoon
    Hashemi, Naser
    Khonsari, Ahmad
    [J]. JOURNAL OF SUPERCOMPUTING, 2016, 72 (03): : 874 - 899
  • [9] Proactive Data Migration for Improved Storage Availability in Large-Scale Data Centers
    Wu, Suzhen
    Jiang, Hong
    Mao, Bo
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2015, 64 (09) : 2637 - 2651
  • [10] DRAM Errors in the Wild: A Large-Scale Field Study
    Schroeder, Bianca
    Pinheiro, Eduardo
    Weber, Wolf-Dietrich
    [J]. COMMUNICATIONS OF THE ACM, 2011, 54 (02) : 100 - 107