Accelerating distributed deep neural network training with pipelined MPI allreduce

被引:6
|
作者
Castello, Adrian [1 ]
Quintana-Orti, Enrique S. [1 ]
Duato, Jose [1 ]
机构
[1] Univ Politecn Valencia, Valencia, Spain
关键词
Message Passing Interface (MPI); Collective communication primitives; Allreduce; Deep learning; Distributed training; COLLECTIVE COMMUNICATION;
D O I
10.1007/s10586-021-03370-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.
引用
收藏
页码:3797 / 3813
页数:17
相关论文
共 50 条
  • [1] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Adrián Castelló
    Enrique S. Quintana-Ortí
    José Duato
    [J]. Cluster Computing, 2021, 24 : 3797 - 3813
  • [2] Evaluation of MPI Allreduce for Distributed Training of Convolutional Neural Networks
    Castello, Adrian
    Catalan, Mar
    Dolz, Manuel F.
    Mestre, Jose, I
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. 2021 29TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2021), 2021, : 109 - 116
  • [3] Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
    Castello, Adrian
    Catalan, Mar
    Dolz, Manuel F.
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. COMPUTING, 2023, 105 (05) : 1101 - 1119
  • [4] Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks
    Adrián Castelló
    Mar Catalán
    Manuel F. Dolz
    Enrique S. Quintana-Ortí
    José Duato
    [J]. Computing, 2023, 105 : 1101 - 1119
  • [5] Accelerating Training for Distributed Deep Neural Networks in MapReduce
    Xu, Jie
    Wang, Jingyu
    Qi, Qi
    Sun, Haifeng
    Liao, Jianxin
    [J]. WEB SERVICES - ICWS 2018, 2018, 10966 : 181 - 195
  • [6] Accelerating Data Loading in Deep Neural Network Training
    Yang, Chih-Chieh
    Cong, Guojing
    [J]. 2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 235 - 245
  • [7] PipeCompress: Accelerating Pipelined Communication for Distributed Deep Learning
    Liu, Juncai
    Wang, Jessie Hui
    Rong, Chenghao
    Wang, Jilong
    [J]. IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, : 207 - 212
  • [8] An Allreduce Algorithm and Network Co-design for Large-Scale Training of Distributed Deep Learning
    Nguyen, Truong Thao
    Wahib, Mohamed
    [J]. 21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 396 - 405
  • [9] EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks
    Li, Shengwei
    Lai, Zhiquan
    Li, Dongsheng
    Zhang, Yiming
    Ye, Xiangyu
    Duan, Yabo
    [J]. 51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022, 2022,
  • [10] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
    Coquelin, Daniel
    Debus, Charlotte
    Goetz, Markus
    von der Lehr, Fabrice
    Kahn, James
    Siggel, Martin
    Streit, Achim
    [J]. JOURNAL OF BIG DATA, 2022, 9 (01)