Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

被引：0

作者：

Daniel Coquelin

Charlotte Debus

Markus Götz

Fabrice von der Lehr

James Kahn

Martin Siggel

Achim Streit

机构：

[1] Karlsruhe Institute of Technology,

[2] German Aerospace Center,undefined

来源：

Journal of Big Data | / 9卷

关键词：

Machine learning; Neural networks; Data parallel training; Multi-node; Multi-GPU; Stale gradients;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations after each forward-backward pass. This synchronization is the central algorithmic bottleneck. We introduce the distributed asynchronous and selective optimization (DASO) method, which leverages multi-GPU compute node architectures to accelerate network training while maintaining accuracy. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to current optimized data parallel training methods.

引用

共 50 条

[1] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
Coquelin, Daniel
Debus, Charlotte
Goetz, Markus
von der Lehr, Fabrice
Kahn, James
Siggel, Martin
Streit, Achim
[J]. JOURNAL OF BIG DATA, 2022, 9 (01)
[2] Accelerating distributed deep neural network training with pipelined MPI allreduce
Castello, Adrian
Quintana-Orti, Enrique S.
Duato, Jose
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
[3] Accelerating distributed deep neural network training with pipelined MPI allreduce
Adrián Castelló
Enrique S. Quintana-Ortí
José Duato
[J]. Cluster Computing, 2021, 24 : 3797 - 3813
[4] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
Zhang, Shiwei
Diao, Lansong
Wu, Chuan
Wang, Siyu
Lin, Wei
[J]. PROCEEDINGS OF THE 13TH SYMPOSIUM ON CLOUD COMPUTING, SOCC 2022, 2022, : 403 - 418
[5] OPTIMIZATION OF DISTRIBUTED CONVOLUTIONAL NEURAL NETWORK FOR IMAGE LABELING ON ASYNCHRONOUS GPU MODEL
Fu, Jinhua
Huang, Yongzhong
Xu, Jie
Wu, Huaiguang
[J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2019, 15 (03): : 1145 - 1156
[6] Distributed Asynchronous Optimization of Convolutional Neural Networks
Chan, William
Lane, Ian
[J]. 15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1073 - 1077
[7] Accelerating Distributed ML Training via Selective Synchronization
Tyagi, Sahil
Swany, Martin
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING WORKSHOPS, CLUSTER WORKSHOPS, 2023, : 56 - 57
[8] Accelerating Distributed ML Training via Selective Synchronization
Tyagi, Sahil
Swany, Martin
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER, 2023, : 1 - 12
[9] Accelerating Training for Distributed Deep Neural Networks in MapReduce
Xu, Jie
Wang, Jingyu
Qi, Qi
Sun, Haifeng
Liao, Jianxin
[J]. WEB SERVICES - ICWS 2018, 2018, 10966 : 181 - 195
[10] Accelerating Neural Network Training: A Brief Review
Nokhwal, Sahil
Chilakalapudi, Priyanka
Donekal, Preeti
Nokhwal, Suman
Pahune, Saurabh
Chaudhary, Ankit
[J]. 2024 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS, METAHEURISTICS & SWARM INTELLIGENCE, ISMSI 2024, 2024, : 31 - 35

← 1 2 3 4 5 →