Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

被引:0
|
作者
Daniel Coquelin
Charlotte Debus
Markus Götz
Fabrice von der Lehr
James Kahn
Martin Siggel
Achim Streit
机构
[1] Karlsruhe Institute of Technology,
[2] German Aerospace Center,undefined
来源
关键词
Machine learning; Neural networks; Data parallel training; Multi-node; Multi-GPU; Stale gradients;
D O I
暂无
中图分类号
学科分类号
摘要
With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations after each forward-backward pass. This synchronization is the central algorithmic bottleneck. We introduce the distributed asynchronous and selective optimization (DASO) method, which leverages multi-GPU compute node architectures to accelerate network training while maintaining accuracy. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to current optimized data parallel training methods.
引用
收藏
相关论文
共 50 条
  • [1] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
    Coquelin, Daniel
    Debus, Charlotte
    Goetz, Markus
    von der Lehr, Fabrice
    Kahn, James
    Siggel, Martin
    Streit, Achim
    [J]. JOURNAL OF BIG DATA, 2022, 9 (01)
  • [2] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Castello, Adrian
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
  • [3] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Adrián Castelló
    Enrique S. Quintana-Ortí
    José Duato
    [J]. Cluster Computing, 2021, 24 : 3797 - 3813
  • [4] Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
    Zhang, Shiwei
    Diao, Lansong
    Wu, Chuan
    Wang, Siyu
    Lin, Wei
    [J]. PROCEEDINGS OF THE 13TH SYMPOSIUM ON CLOUD COMPUTING, SOCC 2022, 2022, : 403 - 418
  • [5] OPTIMIZATION OF DISTRIBUTED CONVOLUTIONAL NEURAL NETWORK FOR IMAGE LABELING ON ASYNCHRONOUS GPU MODEL
    Fu, Jinhua
    Huang, Yongzhong
    Xu, Jie
    Wu, Huaiguang
    [J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2019, 15 (03): : 1145 - 1156
  • [6] Distributed Asynchronous Optimization of Convolutional Neural Networks
    Chan, William
    Lane, Ian
    [J]. 15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1073 - 1077
  • [7] Accelerating Distributed ML Training via Selective Synchronization
    Tyagi, Sahil
    Swany, Martin
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING WORKSHOPS, CLUSTER WORKSHOPS, 2023, : 56 - 57
  • [8] Accelerating Distributed ML Training via Selective Synchronization
    Tyagi, Sahil
    Swany, Martin
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, CLUSTER, 2023, : 1 - 12
  • [9] Accelerating Training for Distributed Deep Neural Networks in MapReduce
    Xu, Jie
    Wang, Jingyu
    Qi, Qi
    Sun, Haifeng
    Liao, Jianxin
    [J]. WEB SERVICES - ICWS 2018, 2018, 10966 : 181 - 195
  • [10] Accelerating Neural Network Training: A Brief Review
    Nokhwal, Sahil
    Chilakalapudi, Priyanka
    Donekal, Preeti
    Nokhwal, Suman
    Pahune, Saurabh
    Chaudhary, Ankit
    [J]. 2024 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS, METAHEURISTICS & SWARM INTELLIGENCE, ISMSI 2024, 2024, : 31 - 35