FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems

被引：3

作者：

Muthukrishnan, Harini ^{[1
,2
]}

Lustig, Daniel ^{[1
]}

Villa, Oreste ^{[1
]}

Wenisch, Thomas ^{[2
]}

Nellans, David ^{[1
]}

机构：

[1] NVIDIA, Santa Clara, CA 95051 USA

[2] Univ Michigan, Ann Arbor, MI 48109 USA

来源：

2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA | 2023年

关键词：

MEMORY; MANAGEMENT; PLACEMENT;

D O I：

10.1109/HPCA56546.2023.10070949

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU's weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU's virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3x. This results in 4-GPU strong scaling performance 1.4x better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.

引用

页码：516 / 529

页数：14

共 50 条

[1] Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers
Muthukrishnan, Harini
Nellans, David
Lustig, Daniel
Fessler, Jeffrey A.
Wenisch, Thomas F.
2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 139 - 152
[2] Exploring Fine-Grained Task-based Execution on Multi-GPU Systems
Chen, Long
Villa, Oreste
Gao, Guang R.
2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 386 - 394
[3] REC: Enhancing fine-grained cache coherence protocol in multi-GPU systems
Ko, Gun
Lee, Jiwon
Kal, Hongju
Lee, Hyunwuk
Ro, Won Woo
Journal of Systems Architecture, 2025, 160
[4] GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement
Wang, Yueqi
Li, Bingyao
Jaleel, Aamer
Yang, Jun
Tang, Xulong
2024 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA 2024, 2024, : 1080 - 1094
[5] TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming Scheduling
Jin, Hai
Wu, Wenchao
Shi, Xuanhua
He, Ligang
Zhou, Bing Bing
IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (04) : 552 - 565
[6] Benchmarking multi-GPU applications on modern multi-GPU integrated systems
Bernaschi, Massimo
Agostini, Elena
Rossetti, Davide
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (14):
[7] Modelling Multi-GPU Systems
Spampinato, Daniele G.
Elster, Anne C.
Natvig, Thorvald
PARALLEL COMPUTING: FROM MULTICORES AND GPU'S TO PETASCALE, 2010, 19 : 562 - 569
[8] Consumer Level Multi-GPU Systems Utilization, Efficiency, and Optimization
Ross, John Brandon
2013 PROCEEDINGS OF IEEE SOUTHEASTCON, 2013,
[9] A Multi-GPU PCISPH Implementation with Efficient Memory Transfers
Verma, Kevin
Peng, Chong
Szewc, Kamil
Wille, Robert
2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2018,
[10] A Fine-grained Performance Model for GPU Architectures
Bombieri, Nicola
Busato, Federico
Fummi, Franco
PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2016, : 1267 - 1272

← 1 2 3 4 5 →