Data-centric Reliability Management in GPUs

被引:1
|
作者
Kadam, Gurunath [1 ]
Smirni, Evgenia [1 ]
Jog, Adwait [1 ]
机构
[1] William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA
基金
美国国家科学基金会;
关键词
GPUs; Reliability; Multi-bit Faults; Application Resilience; MEMORY; ERROR; RESILIENCE;
D O I
10.1109/DSN48987.2021.00040
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Graphics Processing Units (GPUs) have become the default choice of acceleration in a wide range of application domains. To keep up with computational demands, the GPU memory system is constantly being innovated from both the cache and DRAM perspectives. Such innovations can adversely affect GPU reliability and in fact, can lead to an increase in the number of multi-bit faults. To address this problem, we systematically study a wide range of GPGPU applications and find that usually, only a small percentage of data needs protection to increase application resilience. This data is highly accessed and shared (constitutes hot memory), which implies that faults in this space can often lead to incorrect application output. An in-depth analysis of application code shows that information of such data can be passed on to the hardware to guide low-overhead detection/correction schemes. In this vein, we developed low-overhead partial data replication schemes that exploit latency tolerance in GPUs. Overall, this data-centric approach dramatically improves GPGPU application resilience, with a minimal additional average performance overhead of 1.2% for detection-only and 3.4% for detection-and-correction.
引用
收藏
页码:271 / 283
页数:13
相关论文
共 50 条
  • [1] A Data-Centric Approach to Change Management
    Nwokeji, Joshua Chibuike
    Clark, Tony
    Barn, Balbir
    Kulkarni, Vinay
    Anum, Sheena O.
    [J]. PROCEEDINGS OF THE 2015 IEEE 19TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE, 2015, : 185 - 190
  • [2] Reliability evaluation of individual predictions: a data-centric approach
    Shahbazi, Nima
    Asudeh, Abolfazl
    [J]. VLDB JOURNAL, 2024, 33 (04): : 1203 - 1230
  • [3] A DATA-CENTRIC APPROACH FOR INTEGRATED DATA CENTER MANAGEMENT
    Hoover, Christopher
    [J]. PROCEEDINGS OF THE ASME PACIFIC RIM TECHNICAL CONFERENCE AND EXHIBITION ON PACKAGING AND INTEGRATION OF ELECTRONIC AND PHOTONIC SYSTEMS, MEMS AND NEMS 2011, VOL 2, 2012, : 565 - 576
  • [4] A data-centric distributed framework for MDO management
    Chen, B
    Liu, DJ
    Mahdavi, B
    Zhou, Q
    [J]. PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, 2001, : 279 - 284
  • [5] Data-Centric AI
    Malerba, Donato
    Pasquadibisceglie, Vincenzo
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024,
  • [6] Data-Centric Service-Oriented Management of Things
    Pahl, Marc-Oliver
    [J]. PROCEEDINGS OF THE 2015 IFIP/IEEE INTERNATIONAL SYMPOSIUM ON INTEGRATED NETWORK MANAGEMENT (IM), 2015, : 484 - 490
  • [7] Cellular network management goals and data-centric solutions
    Pashtan, A
    Abel, M
    [J]. IEEE COMMUNICATIONS MAGAZINE, 2001, 39 (10) : 136 - 144
  • [8] Normative Ontologies for Data-Centric Business Process Management
    Poernomo, Iman
    Umarov, Timur
    [J]. EDOCW: 2008 12TH ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE WORKSHOPS, 2008, : 84 - 95
  • [9] Reusable architecture for data-centric network management systems
    Gopal, R
    Whitefield, D
    [J]. INTEGRATED NETWORK MANAGEMENT VI: DISTRIBUTED MANAGEMENT FOR THE NETWORKED MILLENNIUM, 1999, : 325 - 338
  • [10] Enabling data-centric Al through data quality management and data literacy
    Abedjan, Ziawasch
    [J]. IT-INFORMATION TECHNOLOGY, 2022, 64 (1-2): : 67 - 70