Multi-Node Acceleration for Large-Scale GCNs

被引：4

作者：

Sun, Gongjian ^{[1
,2
]}

Yan, Mingyu ^{[1
,2
]}

Wang, Duo ^{[1
,2
]}

Li, Han ^{[1
,2
]}

Li, Wenming ^{[1
,2
]}

Ye, Xiaochun ^{[1
,2
]}

Fan, Dongrui ^{[1
,2
]}

Xie, Yuan ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100045, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 101408, Peoples R China

[3] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2022年 / 71卷 / 12期

基金：

中国国家自然科学基金;

关键词：

Deep learning; graph neural network; hardware accelerator; multi-node system; communication optimization; NETWORK;

D O I：

10.1109/TC.2022.3207127

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Limited by the memory capacity and computation power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like tensor processing unit (TPU) Pod for large-scale neural network. In this work, we aim to scale up single-node GCN accelerator to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) irregular coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) the acceleration of GCNs in MultiAccSys is mainly bounded by network bandwidth but tolerates network latency. Guided by the above observations, we then propose MultiGCN, an efficient MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4 & SIM; 12x speedup using only 28%$\sim$& SIM;68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. Besides, MultiGCN not only achieves 2.5 & SIM; 8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graph as opposed to single-node GCN accelerators.

引用

页码：3140 / 3152

页数：13

共 50 条

[31] A Scalable Mobile Multi-Node Channel Sounder
Zelenbaba, Stefan
Loeschenbrand, David
Hofer, Markus
Dakic, Anja
Rainer, Benjamin
Humer, Gerhard
Zemen, Thomas
2020 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE (WCNC), 2020,
[32] Multi-Node of Synergy Certificateless Authentication in WSNs
Liu, Guangcong
Shi, Yuanjie
Li, Cong
INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY, PTS 1-4, 2013, 263-266 : 3047 - 3051
[33] Multi-node broadcasting in hypercubes and star graphs
Tseng, YC
ICA(3)PP 97 - 1997 3RD INTERNATIONAL CONFERENCE ON ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, 1997, : 51 - 58
[34] A multi-node MPI implementation on the CellBE processor
Liu, Chao
Zhang, Xingjun
Feng, Guofu
Feng, Jinghua
Dong, Xiaoshe
Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2010, 38 (SUPPL. 1): : 59 - 63
[35] Multi-node higher order expansions of a function
Han, XL
JOURNAL OF APPROXIMATION THEORY, 2003, 124 (02) : 242 - 253
[36] A decentralized scheme for multi-node broadcasting on hypercubes
Hayakawa, Y
Fujita, S
Yamashita, M
THIRD INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS, AND NETWORKS, PROCEEDINGS (I-SPAN '97), 1997, : 487 - 493
[37] Receiver-Controlled Multi-Node Cooperation
Liang Ye
Yue Li
Xue-Jun Sha
Esko Alasaarela
Journal of Harbin Institute of Technology(New series), 2014, (02) : 6 - 12
[38] Centralized Multi-Node Repair in Distributed Storage
Zorgui, Marwen
Wang, Zhiying
2016 54TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2016, : 617 - 624
[39] Skewed checkpointing for tolerating multi-node failures
Nakamura, H
Hayashida, T
Kondo, M
Tajima, Y
Imal, M
Nanya, T
23RD IEEE INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2004, : 116 - 125
[40] WSN Node Applied to Large-Scale Unattended Monitoring
鲍玉军
姬长英
陈功
傅振华
Transactions of Nanjing University of Aeronautics and Astronautics, 2016, 33 (03) : 386 - 394

← 1 2 3 4 5 →