Multi-Node Acceleration for Large-Scale GCNs

被引：4

作者：

Sun, Gongjian ^{[1
,2
]}

Yan, Mingyu ^{[1
,2
]}

Wang, Duo ^{[1
,2
]}

Li, Han ^{[1
,2
]}

Li, Wenming ^{[1
,2
]}

Ye, Xiaochun ^{[1
,2
]}

Fan, Dongrui ^{[1
,2
]}

Xie, Yuan ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100045, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 101408, Peoples R China

[3] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2022年 / 71卷 / 12期

基金：

中国国家自然科学基金;

关键词：

Deep learning; graph neural network; hardware accelerator; multi-node system; communication optimization; NETWORK;

D O I：

10.1109/TC.2022.3207127

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Limited by the memory capacity and computation power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like tensor processing unit (TPU) Pod for large-scale neural network. In this work, we aim to scale up single-node GCN accelerator to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) irregular coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) the acceleration of GCNs in MultiAccSys is mainly bounded by network bandwidth but tolerates network latency. Guided by the above observations, we then propose MultiGCN, an efficient MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4 & SIM; 12x speedup using only 28%$\sim$& SIM;68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. Besides, MultiGCN not only achieves 2.5 & SIM; 8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graph as opposed to single-node GCN accelerators.

引用

页码：3140 / 3152

页数：13

共 50 条

[21] FAWS: FPGA Acceleration of Large-Scale Wave Simulations
Gourounas, Dimitrios
Hanindhito, Bagus
Fathi, Arash
Trenev, Dimitar
John, Lizy K.
Gerstlauer, Andreas
2023 IEEE 34TH INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS, ASAP, 2023, : 76 - 84
[22] GPU Acceleration of Zernike Moments for Large-scale Images
Ujaldon, Manuel
2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 2033 - 2040
[23] Particle acceleration in large-scale DC electric fields
Holman, GD
HIGH ENERGY SOLAR PHYSICS - ANTICIPATING HESSI, 2000, 206 : 135 - 144
[24] On the Acceleration of the Vector Fitting for Multiport Large-Scale Macromodeling
Chou, Chiu-Chih
Schutt-Aine, Jose E.
IEEE MICROWAVE AND WIRELESS COMPONENTS LETTERS, 2021, 31 (01) : 1 - 4
[25] GPU acceleration of ADMM for large-scale quadratic programming
Schubiger, Michel
Banjac, Goran
Lygeros, John
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 144 : 55 - 67
[26] Electron acceleration sites in a large-scale coronal structure
Klein, KL
Aurass, H
SoruEscaut, I
Kalman, B
ASTRONOMY & ASTROPHYSICS, 1997, 320 (02) : 612 - 619
[27] Large scale multi-node simulations of Z2 gauge theory quantum circuits using Google Cloud Platform
Gustafson, Erik
Holzman, Burt
Kowalkowski, James
Lamm, Henry
Li, Andy C. Y.
Perdue, Gabriel
Isakov, Sergei, V
Martin, Orion
Thomson, Ross
Beall, Jackson
Ganahl, Martin
Vidal, Guifre
Peters, Evan
Boixo, Sergio
PROCEEDINGS OF SECOND INTERNATIONAL WORKSHOP ON QUANTUM COMPUTING SOFTWARE (QCS 2021), 2021, : 72 - 79
[28] Acceleration of 3D ECT image reconstruction in heterogeneous, multi-GPU, multi-node distributed system
Majchrowicz, Michal
Kapusta, Pawel
Jackowska-Strumillo, Lidia
Sankowski, Dominik
PROCEEDINGS OF THE 2018 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2018, : 347 - 350
[29] Robustness of scale-free networks with dynamical behavior against multi-node perturbation
Lv, Changchun
Yuan, Ziwei
Si, Shubin
Duan, Dongli
CHAOS SOLITONS & FRACTALS, 2021, 152
[30] Optimization of a Multi-node System Reliability Model
魏展明
周凡
陈耀武
Journal of Donghua University(English Edition), 2011, 28 (05) : 451 - 455

← 1 2 3 4 5 →