Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

被引:1
|
作者
Belayneh, Leul [1 ]
Ye, Haojie [1 ]
Chen, Kuan-Yu [1 ]
Blaauw, David [1 ]
Mudge, Trevor [1 ]
Dreslinski, Ronald [1 ]
Talati, Nishil [1 ]
机构
[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA
关键词
GPGPU; multi-GPU; data movement; GPU cache management; CACHE;
D O I
10.1145/3559009.3569649
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5x on a 4-GPU system, with a small hardware overhead of 0.032%.
引用
收藏
页码:304 / 316
页数:13
相关论文
共 50 条
  • [1] Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing
    Fan, Quan
    Chen, Zizhong
    2021 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS), 2021, : 148 - 151
  • [2] Locality-Aware GPU Register File
    Jeon, Hyeran
    Esfeden, Hodjat Asghari
    Abu-Ghazaleh, Nael B.
    Wong, Daniel
    Elango, Sindhuja
    IEEE COMPUTER ARCHITECTURE LETTERS, 2019, 18 (02) : 153 - 156
  • [3] Topology-aware Optimizations for Multi-GPU Ptychographic Image Reconstruction
    Yu, Xiaodong
    Bicer, Tekin
    Kettimuthu, Rajkumar
    Foster, Ian T.
    PROCEEDINGS OF THE 2021 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2021, 2021, : 354 - 366
  • [4] LARA: Locality-aware resource allocation to improve GPU memory-access time
    Hossein BiTalebi
    Farshad Safaei
    The Journal of Supercomputing, 2021, 77 : 14438 - 14460
  • [5] LARA: Locality-aware resource allocation to improve GPU memory-access time
    BiTalebi, Hossein
    Safaei, Farshad
    JOURNAL OF SUPERCOMPUTING, 2021, 77 (12): : 14438 - 14460
  • [6] Locality-Aware Memory Association for Multi-Target Worksharing in OpenMP
    Scogland, Thomas R. W.
    Feng, Wu-Chun
    PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'14), 2014, : 515 - 516
  • [7] Benchmarking multi-GPU applications on modern multi-GPU integrated systems
    Bernaschi, Massimo
    Agostini, Elena
    Rossetti, Davide
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (14):
  • [8] Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
    Choi, Sangjin
    Kim, Taeksoo
    Jeong, Jinwoo
    Ausavarungnirun, Rachata
    Jeon, Myeongjae
    Kwon, Youngjin
    Ahn, Jeongseob
    PROCEEDINGS OF THE 2022 USENIX ANNUAL TECHNICAL CONFERENCE, 2022, : 625 - 638
  • [9] Locality-aware Partitioning in Parallel Database Systems
    Zamanian, Erfan
    Binnig, Carsten
    Salama, Abdallah
    SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 17 - 30
  • [10] Modelling Multi-GPU Systems
    Spampinato, Daniele G.
    Elster, Anne C.
    Natvig, Thorvald
    PARALLEL COMPUTING: FROM MULTICORES AND GPU'S TO PETASCALE, 2010, 19 : 562 - 569