Evaluating the Viability of Application-Driven Cooperative CPU/GPU Fault Detection

被引:0
|
作者
Li, Dong [1 ]
Lee, Seyong [1 ]
Vetter, Jeffrey S. [1 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
关键词
fault detection; heterogeneous computing;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Trends in high performance computing are bringing increased heterogeneity among the computational resources within a single machine. The heterogeneous CPU/GPU platforms, however, exacerbate resilience problems faced by current large-scale systems. How to design efficient resilience strategies is critical for the wider adoption of heterogeneous platforms for future exascale systems. The conventional resilience strategy for GPU brings significant performance and power overhead, because they employ a one-size-fits-all approach to enforce uniform data protection. In addition, the isolation between CPU and GPU protection loses potential optimization opportunities provided by the heterogeneous CPU/GPU platforms. In this paper, we explore the viability of using an application-driven CPU/GPU cooperative method to detect faults occurred on GPU global memory. By selectively protecting application-critical data and leveraging time and space redundancy in CPU to detect faults, we bring only 2.2% performance overhead while capturing more than 90% errors that cause incorrect application results.
引用
收藏
页码:670 / 679
页数:10
相关论文
共 27 条
  • [11] Application of signal analysis and data-driven approaches to fault detection and diagnosis in automotive engines
    Namburu, Setu Madhavi
    Chigusa, Shunsuke
    Qiao, Liu
    Azam, Mohammad
    Pattipati, Krishna R.
    2006 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-6, PROCEEDINGS, 2006, : 3665 - +
  • [12] A comparison of data-driven fault detection methods with application to aerospace electro-mechanical actuators
    Mazzoleni, M.
    Maccarana, Y.
    Previdi, F.
    IFAC PAPERSONLINE, 2017, 50 (01): : 12797 - 12802
  • [13] Data-driven realizations of kernel and image representations and their application to fault detection and control system design
    Ding, Steven X.
    Yang, Ying
    Zhang, Yong
    Li, Linlin
    AUTOMATICA, 2014, 50 (10) : 2615 - 2623
  • [14] Subspace-Aided Data-Driven Robust Distributed Detection With Cooperative Fault Sensing for Large-Scale Systems
    Li, Biao
    Yang, Ying
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (08) : 10387 - 10397
  • [15] Fault detection via data-driven K-gap metric with application to ship propulsion systems
    Li, He
    Yang, Ying
    Zhao, Zhengen
    Zhou, Jing
    Liu, Ruijie
    2018 37TH CHINESE CONTROL CONFERENCE (CCC), 2018, : 6023 - 6027
  • [16] Enhanced dynamic data-driven fault detection approach: application to a two-tank heater system
    Harrou, Fouzi
    Madakyaru, Muddu
    Sun, Ying
    Kammammettu, Sanjula
    2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017, : 982 - 987
  • [17] Stochastic resonance in an asymmetric bistable system driven by multiplicative and additive Gaussian noise and its application in bearing fault detection
    Zhang Gang
    Zhang Yijun
    Zhang Tianqi
    Rana, Mdsohel
    CHINESE JOURNAL OF PHYSICS, 2018, 56 (03) : 1173 - 1186
  • [18] Tri-stable stochastic resonance coupling system driven by dual-input signals and its application in bearing fault detection
    Zhang, Gang
    Zeng, Yujie
    He, Lifang
    PHYSICA SCRIPTA, 2022, 97 (04)
  • [19] Delay segmented tristable stochastic resonance system driven by non-gaussian colored noise and its application in bearing fault detection
    He, Lifang
    Cao, Longmei
    Zhang, Junsheng
    PHYSICA SCRIPTA, 2024, 99 (07)
  • [20] Data-Driven H_/H∞ Fault Detection and Control in Finite-Frequency Domain With Application to Steel Rolling Process
    Liu, He
    Li, Xiao-Jian
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, : 1 - 12