Multi-Node Acceleration for Large-Scale GCNs

被引:4
|
作者
Sun, Gongjian [1 ,2 ]
Yan, Mingyu [1 ,2 ]
Wang, Duo [1 ,2 ]
Li, Han [1 ,2 ]
Li, Wenming [1 ,2 ]
Ye, Xiaochun [1 ,2 ]
Fan, Dongrui [1 ,2 ]
Xie, Yuan [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100045, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 101408, Peoples R China
[3] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
基金
中国国家自然科学基金;
关键词
Deep learning; graph neural network; hardware accelerator; multi-node system; communication optimization; NETWORK;
D O I
10.1109/TC.2022.3207127
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Limited by the memory capacity and computation power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like tensor processing unit (TPU) Pod for large-scale neural network. In this work, we aim to scale up single-node GCN accelerator to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) irregular coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) the acceleration of GCNs in MultiAccSys is mainly bounded by network bandwidth but tolerates network latency. Guided by the above observations, we then propose MultiGCN, an efficient MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4 & SIM; 12x speedup using only 28%$\sim$& SIM;68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. Besides, MultiGCN not only achieves 2.5 & SIM; 8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graph as opposed to single-node GCN accelerators.
引用
收藏
页码:3140 / 3152
页数:13
相关论文
共 50 条
  • [31] A Scalable Mobile Multi-Node Channel Sounder
    Zelenbaba, Stefan
    Loeschenbrand, David
    Hofer, Markus
    Dakic, Anja
    Rainer, Benjamin
    Humer, Gerhard
    Zemen, Thomas
    2020 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE (WCNC), 2020,
  • [32] Multi-Node of Synergy Certificateless Authentication in WSNs
    Liu, Guangcong
    Shi, Yuanjie
    Li, Cong
    INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY, PTS 1-4, 2013, 263-266 : 3047 - 3051
  • [33] Multi-node broadcasting in hypercubes and star graphs
    Tseng, YC
    ICA(3)PP 97 - 1997 3RD INTERNATIONAL CONFERENCE ON ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, 1997, : 51 - 58
  • [34] A multi-node MPI implementation on the CellBE processor
    Liu, Chao
    Zhang, Xingjun
    Feng, Guofu
    Feng, Jinghua
    Dong, Xiaoshe
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2010, 38 (SUPPL. 1): : 59 - 63
  • [35] Multi-node higher order expansions of a function
    Han, XL
    JOURNAL OF APPROXIMATION THEORY, 2003, 124 (02) : 242 - 253
  • [36] A decentralized scheme for multi-node broadcasting on hypercubes
    Hayakawa, Y
    Fujita, S
    Yamashita, M
    THIRD INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS, AND NETWORKS, PROCEEDINGS (I-SPAN '97), 1997, : 487 - 493
  • [37] Receiver-Controlled Multi-Node Cooperation
    Liang Ye
    Yue Li
    Xue-Jun Sha
    Esko Alasaarela
    Journal of Harbin Institute of Technology(New series), 2014, (02) : 6 - 12
  • [38] Centralized Multi-Node Repair in Distributed Storage
    Zorgui, Marwen
    Wang, Zhiying
    2016 54TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2016, : 617 - 624
  • [39] Skewed checkpointing for tolerating multi-node failures
    Nakamura, H
    Hayashida, T
    Kondo, M
    Tajima, Y
    Imal, M
    Nanya, T
    23RD IEEE INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2004, : 116 - 125
  • [40] WSN Node Applied to Large-Scale Unattended Monitoring
    鲍玉军
    姬长英
    陈功
    傅振华
    Transactions of Nanjing University of Aeronautics and Astronautics, 2016, 33 (03) : 386 - 394