On Workload-Aware DRAM Failure Prediction in Large-Scale Data Centers

被引:4
|
作者
Wang, Xingyi [1 ]
Li, Yu [2 ]
Chen, Yiquan [3 ]
Wang, Shiwen [3 ]
Du, Yin [3 ]
He, Cheng [3 ]
Zhang, YuZhong [3 ]
Chen, Pinan [3 ]
Li, Xin [3 ]
Song, Wenjun [3 ]
Xu, Qiang [2 ]
Jiang, Li [1 ,4 ,5 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Alibaba Grp, Hangzhou, Peoples R China
[4] Shanghai Qi Zhi Inst, Shanghai, Peoples R China
[5] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
DRAM failure prediction; Xperf metrics; HiDEC;
D O I
10.1109/VTS50974.2021.9441059
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
DRAM failures are one of the major hardware threats to the reliability of large-scale data centers since the uncorrectable errors in DRAMs may cause servers to shut down. Existing works try to solve this problem by predicting DRAM failures in advance with Machine Learning models. In these works, correctable errors (CEs) are generally deemed as the most important feature. The major reason behind CEs' emergence is the accumulated stress caused by intensive workloads. Moreover, defective DRAMs will not manifest themselves as system errors until the defective cells are accessed by some specific workloads. Therefore, the running workloads on a server are also important for DRAM failure prediction. In this paper, we focus on the impact of workloads on DRAM failures. We design the workload features from both macroscopical and microscopical aspects, i.e. node-level performance metrics and cell-level DRAM access pattern, respectively. Furthermore, we propose Hierarchical DRAM Error Code (HiDEC) to represent the DRAM access pattern. We leverage several Decision Tree-based models for DRAM failure prediction to highlight the generality of our designed features. Experiments are carried out based on the dataset collected from a real-world commercial data center. The results show that both macroscopic and microscopic features can bring significant improvements to the prediction performance.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] DRAM Failure Prediction in Large-Scale Data Centers
    Yu, Fengyuan
    Xu, Hongzuo
    Jian, Songlei
    Huang, Chenlin
    Wang, Yijie
    Wu, Zhiyue
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON JOINT CLOUD COMPUTING (JCC 2021) / 2021 9TH IEEE INTERNATIONAL CONFERENCE ON MOBILE CLOUD COMPUTING, SERVICES, AND ENGINEERING (MOBILECLOUD 2021), 2021, : 1 - 8
  • [2] Workload-aware anonymization techniques for large-scale datasets
    LeFevre, Kristen
    DeWitt, David J.
    Ramakrishnan, Raghu
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2008, 33 (03):
  • [3] Workload-Aware DRAM Error Prediction using Machine Learning
    Mukhanov, Lev
    Tovletoglou, Konstantinos
    Vandierendonck, Hans
    Nikolopoulos, Dimitrios S.
    Karakonstantis, Georgios
    [J]. PROCEEDINGS OF THE 2019 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2019), 2019, : 106 - 118
  • [4] Workload-Aware Scheduling Across Geo-distributed Data Centers
    Jin, Yibo
    Gao, Yuan
    Qian, Zhuzhong
    Zhai, Mingyu
    Peng, Hui
    Lu, Sanglu
    [J]. 2016 IEEE TRUSTCOM/BIGDATASE/ISPA, 2016, : 1455 - 1462
  • [5] Workload Failure Prediction for Data Centers
    Li, Jie
    Wang, Rui
    Ali, Ghazanfar
    Dang, Tommy
    Sill, Alan
    Chen, Yong
    [J]. 2023 IEEE 16TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, CLOUD, 2023, : 479 - 485
  • [6] An Improved LSTM-Based Prediction Approach for Resources and Workload in Large-Scale Data Centers
    Yuan, Haitao
    Bi, Jing
    Li, Shuang
    Zhang, Jia
    Zhou, MengChu
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (12): : 22816 - 22829
  • [7] FORESEER: Workload-aware Data Storage for MapReduce
    Zou, Jia
    Shi, Juwei
    Liu, Tongping
    Cao, Zhao
    Wang, Chen
    [J]. 2015 IEEE 35th International Conference on Distributed Computing Systems, 2015, : 746 - 747
  • [8] An adaptive workload-aware power consumption measuring method for servers in cloud data centers
    Lin, Weiwei
    Zhang, Yufeng
    Wu, Wentai
    Fong, Simon
    He, Ligang
    Chang, Jia
    [J]. COMPUTING, 2023, 105 (03) : 515 - 538
  • [9] Revealing DRAM Operating GuardBands Through Workload-Aware Error Predictive Modeling
    Mukhanov, Lev
    Tovletoglou, Konstantinos
    Vandierendonck, Hans
    Nikolopoulos, Dimitrios S.
    Karakonstantis, Georgios
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (11) : 1976 - 1987
  • [10] An adaptive workload-aware power consumption measuring method for servers in cloud data centers
    Weiwei Lin
    Yufeng Zhang
    Wentai Wu
    Simon Fong
    Ligang He
    Jia Chang
    [J]. Computing, 2023, 105 : 515 - 538