Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study

被引:0
|
作者
Yu, Qiao [1 ,2 ]
Zhang, Wengui [3 ]
Cardoso, Jorge [1 ,4 ]
Kao, Odej [2 ]
机构
[1] Huawei Munich Res Ctr, Munich, Germany
[2] Tech Univ Berlin, Berlin, Germany
[3] Huawei Technol Co Ltd, Shenzhen, Peoples R China
[4] Univ Coimbra, CISUC, Coimbra, Portugal
关键词
Memory; Failure prediction; AIOps; Uncorrectable error; Reliability; Machine Learning;
D O I
10.1109/ICCAD57390.2023.10323692
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In large-scale datacenters, memory failure is a common cause of server crashes, with uncorrectable errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using correctable errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of uncorrectable errors (UEs). In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.
引用
收藏
页数:9
相关论文
共 41 条
  • [1] Exploring Advanced Techniques for System Prediction: An in-depth review
    Matias, Sheila Marie M.
    [J]. 2023 5TH INTERNATIONAL CONFERENCE ON CONTROL AND ROBOTICS, ICCR, 2023, : 85 - 89
  • [2] In-Memory Subgraph Matching: An In-depth Study
    Sun, Shixuan
    Luo, Qiong
    [J]. SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 1083 - 1098
  • [3] An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers
    Cheng, Zhinan
    Han, Shujie
    Lee, Patrick P. C.
    Li, Xin
    Liu, Jiongzhou
    Li, Zhan
    [J]. 2022 41ST INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2022), 2022, : 262 - 272
  • [4] Challenges and Opportunities: An In-Depth Empirical Study on Configuration Error Injection Testing
    Li, Wang
    Jia, Zhouyang
    Li, Shanshan
    Zhang, Yuanliang
    Wang, Teng
    Xu, Erci
    Wang, Ji
    Liao, Xiangke
    [J]. ISSTA '21: PROCEEDINGS OF THE 30TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, 2021, : 478 - 490
  • [5] Frailty: an in-depth qualitative study exploring the views of community care staff
    Coker, J. F.
    Martin, M. E.
    Simpson, R. M.
    Lafortune, L.
    [J]. BMC GERIATRICS, 2019, 19 (1)
  • [6] Exploring intellectual capital management in SMEs: an in-depth Italian case study
    Marzo, Giuseppe
    Scarpino, Elena
    [J]. JOURNAL OF INTELLECTUAL CAPITAL, 2016, 17 (01) : 27 - 51
  • [7] Frailty: an in-depth qualitative study exploring the views of community care staff
    J. F. Coker
    M. E. Martin
    R. M. Simpson
    L. Lafortune
    [J]. BMC Geriatrics, 19
  • [8] Combining Error Statistics with Failure Prediction in Memory Page Offlining
    Du, Xiaoming
    Li, Cong
    [J]. MEMSYS 2019: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS, 2019, : 127 - 132
  • [9] Exaggerated Error Handling Hurts! An In-Depth Study and Context-Aware Detection
    Pakki, Aditya
    Lu, Kangjie
    [J]. CCS '20: PROCEEDINGS OF THE 2020 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2020, : 1203 - 1218
  • [10] An In-Depth Study of the Potentially Confounding Effect of Class Size in Fault Prediction
    Zhou, Yuming
    Xu, Baowen
    Leung, Hareton
    Chen, Lin
    [J]. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2014, 23 (01)