Fault-Aware Prediction-Guided Page Offlining for Uncorrectable Memory Error Prevention

被引:8
|
作者
Du, Xiaoming [1 ]
Li, Cong [1 ]
Zhou, Shen [1 ]
Liu, Xian [2 ]
Xu, Xiaohan [2 ]
Wang, Tianjiao [2 ]
Ge, Shijian [3 ]
机构
[1] Intel Corp, Shanghai, Peoples R China
[2] ByteDance, Beijing, Peoples R China
[3] ByteDance, Shanghai, Peoples R China
关键词
memory reliability; uncorrectable error prevention; page offlining; row fault identification; uncorrectable error prediction;
D O I
10.1109/ICCD53106.2021.00077
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Uncorrectable memory errors are the major causes of hardware failures in datacenters leading to server crashes. Page offlining is an error-prevention mechanism implemented in modern operating systems. Traditional offlining policies are based on correctable error (CE) rate of a page in a past period. However, CEs are just the observations while the underlying causes are memory circuit faults. A certain fault such as a row fault can impact quite a few pages. Meanwhile, not all faults are equally prone to uncorrectable errors (UEs). In this paper, we propose a fault-aware prediction-guide policy for page offlining. In the proposed policy, we first identify row faults based on CE observations as the preliminary candidates for offlining. Leveraging the knowledge of the error correction code, we design a predictor based on error-bit patterns to predict whether a row fault is prone to UEs or not. Pages impacted by the UE-prone rows are then offlined. Empirical evaluation using the error log from a modern large-scale cluster in ByteDance demonstrates that the proposed policy avoids several times more UEs than the traditional policy does at a comparable cost of memory capacity loss due to page offlining.
引用
收藏
页码:456 / 463
页数:8
相关论文
共 10 条
  • [1] Combining Error Statistics with Failure Prediction in Memory Page Offlining
    Du, Xiaoming
    Li, Cong
    [J]. MEMSYS 2019: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS, 2019, : 127 - 132
  • [2] Fault-Aware ECC Techniques for Reliability Enhancement of Flash Memory
    Lu, Shyue-Kung
    Tsai, Zeng-Long
    Hsu, Chun-Lung
    Sun, Chi-Tien
    [J]. 2020 INTERNATIONAL SYMPOSIUM ON VLSI DESIGN, AUTOMATION AND TEST (VLSI-DAT), 2020,
  • [3] Fault-Aware Dependability Enhancement Techniques for Phase Change Memory
    Lu, Shyue-Kung
    Li, Hui-Ping
    Miyase, Kohei
    Hsu, Chun-Lung
    Sun, Chi-Tien
    [J]. JOURNAL OF ELECTRONIC TESTING-THEORY AND APPLICATIONS, 2021, 37 (04): : 503 - 513
  • [4] Fault-Aware Dependability Enhancement Techniques for Phase Change Memory
    Shyue-Kung Lu
    Hui-Ping Li
    Kohei Miyase
    Chun-Lung Hsu
    Chi-Tien Sun
    [J]. Journal of Electronic Testing, 2021, 37 : 503 - 513
  • [5] Fault-aware grid scheduling using performance prediction by workload modeling
    Kalantari, Mohammad
    Akbari, Mohammad Kazem
    [J]. JOURNAL OF SUPERCOMPUTING, 2008, 46 (01): : 15 - 39
  • [6] Fault-Aware Page Address Remapping Techniques for Enhancing Yield and Reliability of Flash Memories
    Lu, Shyue-Kung
    Yu, Shu-Chi
    Hashizume, Masaki
    Yotsuyanagi, Hiroyuki
    [J]. 2017 IEEE 26TH ASIAN TEST SYMPOSIUM (ATS), 2017, : 249 - 254
  • [7] Fault-aware grid scheduling using performance prediction by workload modeling
    Mohammad Kalantari
    Mohammad Kazem Akbari
    [J]. The Journal of Supercomputing, 2008, 46 : 15 - 39
  • [8] FAQ: Mitigating the Impact of Faults in the Weight Memory of DNN Accelerators through Fault-Aware Quantization
    Hanif, Muhammad Abdullah
    Shafique, Muhammad
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [9] Risk-Aware Decision-Making and Planning Using Prediction-Guided Strategy Tree for the Uncontrolled Intersections
    Zhang, Ting
    Fu, Mengyin
    Song, Wenjie
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2023, 24 (10) : 10791 - 10803
  • [10] Prediction error in implicit adaptation during visually- and memory-guided reaching tasks
    Numasawa, Kosuke
    Miyamoto, Takeshi
    Kizuka, Tomohiro
    Ono, Seiji
    [J]. SCIENTIFIC REPORTS, 2024, 14 (01)