Accelerating distributed deep neural network training with pipelined MPI allreduce

被引：6

作者：

Castello, Adrian ^{[1
]}

Quintana-Orti, Enrique S. ^{[1
]}

Duato, Jose ^{[1
]}

机构：

[1] Univ Politecn Valencia, Valencia, Spain

来源：

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2021年 / 24卷 / 04期

关键词：

Message Passing Interface (MPI); Collective communication primitives; Allreduce; Deep learning; Distributed training; COLLECTIVE COMMUNICATION;

D O I：

10.1007/s10586-021-03370-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.

引用

页码：3797 / 3813

页数：17

共 50 条

[11] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
Daniel Coquelin
Charlotte Debus
Markus Götz
Fabrice von der Lehr
James Kahn
Martin Siggel
Achim Streit
[J]. Journal of Big Data, 9
[12] Distributed Deep Neural Network Training on Edge Devices
Benditkis, Daniel
Keren, Aviv
Mor-Yosef, Liron
Avidor, Tomer
Shoham, Neta
Tal-Israel, Nadav
[J]. SEC'19: PROCEEDINGS OF THE 4TH ACM/IEEE SYMPOSIUM ON EDGE COMPUTING, 2019, : 304 - 306
[13] Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads
Truong Thao Nguyen
Wahib, Mohamed
Takano, Ryousei
[J]. 2018 SIXTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING WORKSHOPS (CANDARW 2018), 2018, : 216 - 222
[14] Accelerating Allreduce With In-Network Reduction on Intel PIUMA
Lakhotia, Kartik
Petrini, Fabrizio
Kannan, Rajgopal
Prasanna, Viktor
[J]. IEEE MICRO, 2022, 42 (02) : 44 - 52
[15] Accelerating deep neural network training for action recognition on a cluster of GPUs
Cong, Guojing
Domeniconi, Giacomo
Shapiro, Joshua
Zhou, Fan
Chen, Barry
[J]. 2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 298 - 305
[16] Accelerating deep neural network training with inconsistent stochastic gradient descent
Wang, Linnan
Yang, Yi
Min, Renqiang
Chakradhar, Srimat
[J]. NEURAL NETWORKS, 2017, 93 : 219 - 229
[17] PipePar: A Pipelined Hybrid Parallel Approach for Accelerating Distributed DNN Training
Li, Jiange
Wang, Yuchen
Zhang, Jinghui
Jin, Jiahui
Dong, Fang
Qian, Lei
[J]. PROCEEDINGS OF THE 2021 IEEE 24TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2021, : 470 - 475
[18] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
Zhang, Shiwei
Diao, Lansong
Wu, Chuan
Wang, Siyu
Lin, Wei
[J]. PROCEEDINGS OF THE 13TH SYMPOSIUM ON CLOUD COMPUTING, SOCC 2022, 2022, : 403 - 418
[19] Deep Neural Network Training With Distributed K-FAC
Pauloski, J. Gregory
Huang, Lei
Xu, Weijia
Chard, Kyle
Foster, Ian T.
Zhang, Zhao
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (12) : 3616 - 3627
[20] High Performance Training of Deep Neural Networks Using Pipelined Hardware Acceleration and Distributed Memory
Mehta, Ragav
Huang, Yuyang
Cheng, Mingxi
Bagga, Shrey
Mathur, Nishant
Li, Ji
Draper, Jeffrey
Nazarian, Shahin
[J]. 2018 19TH INTERNATIONAL SYMPOSIUM ON QUALITY ELECTRONIC DESIGN (ISQED), 2018, : 383 - 388

← 1 2 3 4 5 →