FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems

被引:3
|
作者
Muthukrishnan, Harini [1 ,2 ]
Lustig, Daniel [1 ]
Villa, Oreste [1 ]
Wenisch, Thomas [2 ]
Nellans, David [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Univ Michigan, Ann Arbor, MI 48109 USA
关键词
MEMORY; MANAGEMENT; PLACEMENT;
D O I
10.1109/HPCA56546.2023.10070949
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU's weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU's virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3x. This results in 4-GPU strong scaling performance 1.4x better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.
引用
收藏
页码:516 / 529
页数:14
相关论文
共 50 条
  • [1] Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers
    Muthukrishnan, Harini
    Nellans, David
    Lustig, Daniel
    Fessler, Jeffrey A.
    Wenisch, Thomas F.
    2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 139 - 152
  • [2] Exploring Fine-Grained Task-based Execution on Multi-GPU Systems
    Chen, Long
    Villa, Oreste
    Gao, Guang R.
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 386 - 394
  • [3] REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems
    Ko, Gun
    Lee, Jiwon
    Kal, Hongju
    Lee, Hyunwuk
    Ro, Won Woo
    Journal of Systems Architecture, 2025, 160
  • [4] GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement
    Wang, Yueqi
    Li, Bingyao
    Jaleel, Aamer
    Yang, Jun
    Tang, Xulong
    2024 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA 2024, 2024, : 1080 - 1094
  • [5] TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming Scheduling
    Jin, Hai
    Wu, Wenchao
    Shi, Xuanhua
    He, Ligang
    Zhou, Bing Bing
    IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (04) : 552 - 565
  • [6] Benchmarking multi-GPU applications on modern multi-GPU integrated systems
    Bernaschi, Massimo
    Agostini, Elena
    Rossetti, Davide
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (14):
  • [7] Modelling Multi-GPU Systems
    Spampinato, Daniele G.
    Elster, Anne C.
    Natvig, Thorvald
    PARALLEL COMPUTING: FROM MULTICORES AND GPU'S TO PETASCALE, 2010, 19 : 562 - 569
  • [8] Consumer Level Multi-GPU Systems Utilization, Efficiency, and Optimization
    Ross, John Brandon
    2013 PROCEEDINGS OF IEEE SOUTHEASTCON, 2013,
  • [9] A Multi-GPU PCISPH Implementation with Efficient Memory Transfers
    Verma, Kevin
    Peng, Chong
    Szewc, Kamil
    Wille, Robert
    2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2018,
  • [10] A Fine-grained Performance Model for GPU Architectures
    Bombieri, Nicola
    Busato, Federico
    Fummi, Franco
    PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2016, : 1267 - 1272