Multi-Node Acceleration for Large-Scale GCNs

被引:4
|
作者
Sun, Gongjian [1 ,2 ]
Yan, Mingyu [1 ,2 ]
Wang, Duo [1 ,2 ]
Li, Han [1 ,2 ]
Li, Wenming [1 ,2 ]
Ye, Xiaochun [1 ,2 ]
Fan, Dongrui [1 ,2 ]
Xie, Yuan [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100045, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 101408, Peoples R China
[3] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
基金
中国国家自然科学基金;
关键词
Deep learning; graph neural network; hardware accelerator; multi-node system; communication optimization; NETWORK;
D O I
10.1109/TC.2022.3207127
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Limited by the memory capacity and computation power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like tensor processing unit (TPU) Pod for large-scale neural network. In this work, we aim to scale up single-node GCN accelerator to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) irregular coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) the acceleration of GCNs in MultiAccSys is mainly bounded by network bandwidth but tolerates network latency. Guided by the above observations, we then propose MultiGCN, an efficient MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4 & SIM; 12x speedup using only 28%$\sim$& SIM;68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. Besides, MultiGCN not only achieves 2.5 & SIM; 8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graph as opposed to single-node GCN accelerators.
引用
收藏
页码:3140 / 3152
页数:13
相关论文
共 50 条
  • [11] Large-scale peculiar motions and cosmic acceleration
    Tsagas, Christos G.
    MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY, 2010, 405 (01) : 503 - 508
  • [12] Set Reconciliation in Multi-Node Environment
    Selvan, Aravind
    2013 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATIONS AND NETWORKING TECHNOLOGIES (ICCCNT), 2013,
  • [13] Multi-GPU acceleration of large-scale density-based topology optimization
    Herrero-Perez, David
    Martinez Castejon, Pedro J.
    ADVANCES IN ENGINEERING SOFTWARE, 2021, 157
  • [14] Parameterized complexity of multi-node hubs
    Saurabh, Saket
    Zehavi, Meirav
    JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2023, 131 : 64 - 85
  • [15] ZodiacMSM: A Heterogeneous, Multi-node and Scalable Multi-Scalar Multiplication System for Zero Knowledge Proof Acceleration
    Xu, Yiyang
    Qian, Dahong
    2023 IEEE 36TH INTERNATIONAL SYSTEM-ON-CHIP CONFERENCE, SOCC, 2023, : 330 - 335
  • [16] The Application of Multi-Node Cable Element
    Sun Xiao-yu
    Wang Zhen-qing
    FRONTIERS OF ADVANCED MATERIALS AND ENGINEERING TECHNOLOGY, PTS 1-3, 2012, 430-432 : 1498 - 1501
  • [17] Large-scale VTOA switching node architecture
    Sunaga, H
    Okutani, T
    Miyake, K
    IEICE TRANSACTIONS ON COMMUNICATIONS, 1999, E82B (01): : 70 - 80
  • [18] Large-scale Sparse Structural Node Representation
    Serra, Edoardo
    Joaristi, Mikel
    Cuzzocrea, Alfredo
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5247 - 5253
  • [19] Simulating Large-Scale Structure for Models of Cosmic Acceleration
    Lombriser, Lucas
    OBSERVATORY, 2020, 140 (1275): : 65 - 66