Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

被引:1
|
作者
Belayneh, Leul [1 ]
Ye, Haojie [1 ]
Chen, Kuan-Yu [1 ]
Blaauw, David [1 ]
Mudge, Trevor [1 ]
Dreslinski, Ronald [1 ]
Talati, Nishil [1 ]
机构
[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA
关键词
GPGPU; multi-GPU; data movement; GPU cache management; CACHE;
D O I
10.1145/3559009.3569649
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5x on a 4-GPU system, with a small hardware overhead of 0.032%.
引用
收藏
页码:304 / 316
页数:13
相关论文
共 50 条
  • [41] Accelerating MapReduce framework on multi-GPU systems
    Hai Jiang
    Yi Chen
    Zhi Qiao
    Kuan-Ching Li
    WonWoo Ro
    Jean-Luc Gaudiot
    Cluster Computing, 2014, 17 : 293 - 301
  • [42] An Empirical Evaluation of Allgatherv on Multi-GPU Systems
    Rolinger, Thomas B.
    Simon, Tyler A.
    Krieger, Christopher D.
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 123 - 132
  • [43] Accelerating MapReduce framework on multi-GPU systems
    Jiang, Hai
    Chen, Yi
    Qiao, Zhi
    Li, Kuan-Ching
    Ro, WonWoo
    Gaudiot, Jean-Luc
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2014, 17 (02): : 293 - 301
  • [44] Locality-Aware Stencil Computations using Flash SSDs as Main Memory Extension
    Midorikawa, Hiroko
    Tan, Hideyuki
    2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 1163 - 1168
  • [45] Penalty- and Locality-aware Memory Allocation in Redis Using Enhanced AET
    Pan, Cheng
    Wang, Xiaolin
    Luo, Yingwei
    Wang, Zhenlin
    ACM TRANSACTIONS ON STORAGE, 2021, 17 (02)
  • [46] FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems
    Muthukrishnan, Harini
    Lustig, Daniel
    Villa, Oreste
    Wenisch, Thomas
    Nellans, David
    2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 516 - 529
  • [47] SLAW: A Scalable Locality-aware Adaptive Work-stealing Scheduler for Multi-core Systems
    Guo, Yi
    Zhao, Jisheng
    Cave, Vincent
    Sarkar, Vivek
    PPOPP 2010: PROCEEDINGS OF THE 2010 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2010, : 341 - 342
  • [48] A Locality-aware Cooperative Distributed Memory Caching for Parallel Data Analytic Applications
    Hung, Chia-Ting
    Chou, Jerry
    Chen, Ming-Hung
    Chung, I-Hsin
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 1111 - 1117
  • [49] NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems
    Pearson, Carl
    Chung, I-Hsin
    Sura, Zehra
    Hwu, Wen-Mei
    Xiong, Jinjun
    HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2018, 2018, 11203 : 448 - 454
  • [50] SLAW: A Scalable Locality-aware Adaptive Work-stealing Scheduler for Multi-core Systems
    Guo, Yi
    Zhao, Jisheng
    Cave, Vincent
    Sarkar, Vivek
    ACM SIGPLAN NOTICES, 2010, 45 (05) : 341 - 342