Accelerating distributed deep neural network training with pipelined MPI allreduce

被引:6
|
作者
Castello, Adrian [1 ]
Quintana-Orti, Enrique S. [1 ]
Duato, Jose [1 ]
机构
[1] Univ Politecn Valencia, Valencia, Spain
关键词
Message Passing Interface (MPI); Collective communication primitives; Allreduce; Deep learning; Distributed training; COLLECTIVE COMMUNICATION;
D O I
10.1007/s10586-021-03370-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.
引用
收藏
页码:3797 / 3813
页数:17
相关论文
共 50 条
  • [41] Accelerating CEST imaging using a model-based deep neural network with synthetic training data
    Xu, Jianping
    Zu, Tao
    Hsu, Yi-Cheng
    Wang, Xiaoli
    Chan, Kannie W. Y.
    Zhang, Yi
    [J]. MAGNETIC RESONANCE IN MEDICINE, 2023, : 583 - 599
  • [42] Distributed Framework for Accelerating Training of Deep Learning Models through Prioritization
    Zhou, Tian
    Gao, Lixin
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E 2021, 2021, : 201 - 209
  • [43] Survey on Network of Distributed Deep Learning Training
    Zhu, Hongrui
    Yuan, Guojun
    Yao, Chengji
    Tan, Guangming
    Wang, Zhan
    Hu, Zhongzhe
    Zhang, Xiaoyang
    An, Xuejun
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2021, 58 (01): : 98 - 115
  • [44] Distributed Graph Neural Network Training: A Survey
    Shao, Yingxia
    Li, Hongzheng
    Gu, Xizhi
    Yin, Hongbo
    Li, Yawen
    Miao, Xupeng
    Zhang, Wentao
    Cui, Bin
    Chen, Lei
    [J]. ACM COMPUTING SURVEYS, 2024, 56 (08)
  • [45] NeuralGenesis: a software for distributed neural network training
    Tsoulos, Ioannis
    Tzallas, Alexandros T.
    Tsalikakis, Dimitrios G.
    Giannakeas, Nikolaos
    Tsipouras, Markos G.
    Androulidakis, Iosif
    Zaitseva, Elena
    [J]. 2016 24TH TELECOMMUNICATIONS FORUM (TELFOR), 2016, : 841 - 844
  • [46] Efficient MPI-AllReduce for large-scale deep learning on GPU-clusters
    Truong Thao Nguyen
    Wahib, Mohamed
    Takano, Ryousei
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (12):
  • [47] GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training
    Cai, Tianle
    Luo, Shengjie
    Xu, Keyulu
    He, Di
    Liu, Tie-Yan
    Wang, Liwei
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [48] fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU
    Chen, Zhaodong
    Yan, Mingyu
    Zhu, Maohua
    Deng, Lei
    Li, Guoqi
    Li, Shuangchen
    Xie, Yuan
    [J]. 2020 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED-DESIGN (ICCAD), 2020,
  • [49] Accelerating Neural Network Training with Processing-in-Memory GPU
    Fei, Xiang
    Han, Jianhui
    Huang, Jianqiang
    Zheng, Weimin
    Zhang, Youhui
    [J]. 2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 414 - 421
  • [50] DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices
    Li, Dawei
    Wang, Xiaolong
    Kong, Deguang
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2322 - 2330