FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems

被引:3
|
作者
Muthukrishnan, Harini [1 ,2 ]
Lustig, Daniel [1 ]
Villa, Oreste [1 ]
Wenisch, Thomas [2 ]
Nellans, David [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Univ Michigan, Ann Arbor, MI 48109 USA
关键词
MEMORY; MANAGEMENT; PLACEMENT;
D O I
10.1109/HPCA56546.2023.10070949
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU's weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU's virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3x. This results in 4-GPU strong scaling performance 1.4x better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.
引用
收藏
页码:516 / 529
页数:14
相关论文
共 50 条
  • [21] Suffix Array Construction on Multi-GPU Systems
    Bueren, Florian
    Juenger, Daniel
    Kobus, Robin
    Hundt, Christian
    Schmidt, Bertil
    HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, : 183 - 194
  • [22] Multi-GPU codes for spin systems simulations
    Bernaschi, M.
    Fatica, M.
    Parisi, G.
    Parisi, L.
    COMPUTER PHYSICS COMMUNICATIONS, 2012, 183 (07) : 1416 - 1421
  • [23] Triangle counting on GPU using fine-grained task distribution
    Hu, Lin
    Guan, Naiqing
    Zou, Lei
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW 2019), 2019, : 225 - 232
  • [24] Scalable Betweenness Centrality on Multi-GPU systems
    Bernaschi, Massimo
    Carbone, Giancarlo
    Vella, Flavio
    PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 29 - 36
  • [25] Accelerating MapReduce framework on multi-GPU systems
    Hai Jiang
    Yi Chen
    Zhi Qiao
    Kuan-Ching Li
    WonWoo Ro
    Jean-Luc Gaudiot
    Cluster Computing, 2014, 17 : 293 - 301
  • [26] An Empirical Evaluation of Allgatherv on Multi-GPU Systems
    Rolinger, Thomas B.
    Simon, Tyler A.
    Krieger, Christopher D.
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 123 - 132
  • [27] Accelerating MapReduce framework on multi-GPU systems
    Jiang, Hai
    Chen, Yi
    Qiao, Zhi
    Li, Kuan-Ching
    Ro, WonWoo
    Gaudiot, Jean-Luc
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2014, 17 (02): : 293 - 301
  • [28] Hierarchical Bucket Queuing for Fine-Grained Priority Scheduling on the GPU
    Kerbl, Bernhard
    Kenzel, Michael
    Schmalstieg, Dieter
    Seidel, Hans-Peter
    Steinberger, Markus
    COMPUTER GRAPHICS FORUM, 2017, 36 (08) : 232 - 246
  • [29] Fine-grained GPU parallelization of Pairwise Local Sequence Alignment
    Jain, Chirag
    Kumar, Subodh
    2014 21ST INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2014,
  • [30] Efficient Sharing and Fine-Grained Scheduling of Virtualized GPU Resources
    Zhao, Xiaohui
    Yao, Jianguo
    Gao, Ping
    Guan, Haibing
    2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, : 742 - 752