FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems

被引：3

作者：

Muthukrishnan, Harini ^{[1
,2
]}

Lustig, Daniel ^{[1
]}

Villa, Oreste ^{[1
]}

Wenisch, Thomas ^{[2
]}

Nellans, David ^{[1
]}

机构：

[1] NVIDIA, Santa Clara, CA 95051 USA

[2] Univ Michigan, Ann Arbor, MI 48109 USA

来源：

2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA | 2023年

关键词：

MEMORY; MANAGEMENT; PLACEMENT;

D O I：

10.1109/HPCA56546.2023.10070949

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU's weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU's virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3x. This results in 4-GPU strong scaling performance 1.4x better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.

引用

页码：516 / 529

页数：14

共 50 条

[21] Suffix Array Construction on Multi-GPU Systems
Bueren, Florian
Juenger, Daniel
Kobus, Robin
Hundt, Christian
Schmidt, Bertil
HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, : 183 - 194
[22] Multi-GPU codes for spin systems simulations
Bernaschi, M.
Fatica, M.
Parisi, G.
Parisi, L.
COMPUTER PHYSICS COMMUNICATIONS, 2012, 183 (07) : 1416 - 1421
[23] Triangle counting on GPU using fine-grained task distribution
Hu, Lin
Guan, Naiqing
Zou, Lei
2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW 2019), 2019, : 225 - 232
[24] Scalable Betweenness Centrality on Multi-GPU systems
Bernaschi, Massimo
Carbone, Giancarlo
Vella, Flavio
PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 29 - 36
[25] Accelerating MapReduce framework on multi-GPU systems
Hai Jiang
Yi Chen
Zhi Qiao
Kuan-Ching Li
WonWoo Ro
Jean-Luc Gaudiot
Cluster Computing, 2014, 17 : 293 - 301
[26] An Empirical Evaluation of Allgatherv on Multi-GPU Systems
Rolinger, Thomas B.
Simon, Tyler A.
Krieger, Christopher D.
2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 123 - 132
[27] Accelerating MapReduce framework on multi-GPU systems
Jiang, Hai
Chen, Yi
Qiao, Zhi
Li, Kuan-Ching
Ro, WonWoo
Gaudiot, Jean-Luc
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2014, 17 (02): : 293 - 301
[28] Hierarchical Bucket Queuing for Fine-Grained Priority Scheduling on the GPU
Kerbl, Bernhard
Kenzel, Michael
Schmalstieg, Dieter
Seidel, Hans-Peter
Steinberger, Markus
COMPUTER GRAPHICS FORUM, 2017, 36 (08) : 232 - 246
[29] Fine-grained GPU parallelization of Pairwise Local Sequence Alignment
Jain, Chirag
Kumar, Subodh
2014 21ST INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2014,
[30] Efficient Sharing and Fine-Grained Scheduling of Virtualized GPU Resources
Zhao, Xiaohui
Yao, Jianguo
Gao, Ping
Guan, Haibing
2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, : 742 - 752

← 1 2 3 4 5 →