Detecting SDCs in GPGPUs Through Efficient Partial Thread Redundancy

被引:0
|
作者
Wei, Xiaohui [1 ]
Wu, Yan [1 ]
Jiang, Nan [1 ]
Yue, Hengshan [1 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun, Peoples R China
基金
中国国家自然科学基金;
关键词
GPGPUs; Soft Error; Silent Data Corruptions (SDCs); Partial Thread Protection;
D O I
10.1007/978-981-97-0862-8_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As General-Purpose Graphics Processing Units (GPGPUs) are widely employed in various precision-sensitive and safety-critical domains, guaranteeing the execution reliability of such applications under the impact of soft errors becomes a critical issue. Redundant Multi-Threading (RMT) provides a potentially low-cost mechanism for improving GPGPU reliability, but full protection comes with high time and resource costs. In this paper, we propose a partial thread protection mechanism for efficient Silent Data Corruption (SDC) detection in GPGPU programs. Firstly, we establish an accurate and efficient model for assessing the thread SDC vulnerability by capturing intra-thread error propagation and inter-thread error propagation. Then, based on the analysis results, we selectively replicate the SDC vulnerable threads. Experimental results indicate that our proposed thread SDC vulnerability assessment model closely aligns with the fault injection results, while introducing much lower execution overhead. Our partial thread redundancy mechanism provides a better trade-off between reliability and overhead compared with full RMT.
引用
收藏
页码:224 / 239
页数:16
相关论文
共 50 条
  • [21] Efficient time-extended TV viewing through hybrid data redundancy in networked appliances
    Kim, Eunsam
    Kim, Jinsung
    Lee, Choonhwa
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [22] Throughput Efficient Large M2M Networks through Incremental Redundancy Combining
    Rajanna, Amogh
    Kaveh, Mos
    2017 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE WORKSHOPS (WCNCW), 2017,
  • [23] Energy-efficient multithreading for a hierarchical heterogeneous multicore through locality-cognizant thread generation
    La Fratta, Patrick A.
    Kogge, Peter M.
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (12) : 1551 - 1562
  • [24] Catch Me If You Can: Detecting Compromised Users Through Partial Observation on Networks
    Wang, Derek
    Wen, Sheng
    Xiang, Yang
    Zhou, Wanlei
    Zhang, Jun
    Nepal, Surya
    2017 IEEE 37TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2017), 2017, : 2417 - 2422
  • [25] Efficient Automated Model for Predicting and Detecting the Heart Disease Through Machine Learning
    Parisha
    JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (02) : 2523 - 2530
  • [26] Increasing supply chain resilience through efficient redundancy allocation: a risk-averse mathematical model
    Riccardo, Aldrighetti
    Daria, Battini
    Dmitry, Ivanov
    IFAC PAPERSONLINE, 2021, 54 (01): : 1011 - 1016
  • [27] DRVS: Power-Efficient Reliability Management through Dynamic Redundancy and Voltage Scaling under Variations
    Salehi, Mohammad
    Tavana, Mohammad Khavari
    Rehman, Semeen
    Kriebel, Florian
    Shafique, Muhammad
    Ejlali, Alireza
    Henkel, Joerg
    2015 IEEE/ACM INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN (ISLPED), 2015, : 225 - 230
  • [28] RATE-EFFICIENT ERROR ROBUSTNESS FOR IDR FRAMES THROUGH EDGE-BASED REDUNDANCY MAPS
    Jiang, Aojie
    Agrafiotis, Dimitris
    2013 20TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2013), 2013, : 1608 - 1612
  • [29] Customizing k-Gram based birthmark through partial matching in detecting software thefts
    Lim, Hyun-Il
    Proceedings - International Computer Software and Applications Conference, 2013, : 1 - 4
  • [30] Customizing k-gram Based Birthmark through Partial Matching in Detecting Software Thefts
    Lim, Hyun-il
    2013 IEEE 37TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE WORKSHOPS (COMPSACW), 2013, : 1 - 4