Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning

被引:0
|
作者
Yoon, Daegun [1 ]
Oh, Sangyoon [2 ]
机构
[1] ETRI, Daejeon, South Korea
[2] Ajou Univ, Suwon, South Korea
关键词
distributed deep learning; gradient sparsification; scalability;
D O I
10.1109/CCGrid59990.2024.00043
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Communication overhead is a major obstacle to scaling distributed training systems. Gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity. However, existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms, which raises the communication overhead significantly. In particular, gradient build-up and inadequate sparsity control methods degrade the sparsification performance considerably. Moreover, communication traffic increases drastically owing to workload imbalance of gradient selection between workers. To address these challenges, we propose a novel gradient sparsification scheme called ExDyna. In ExDyna, the gradient tensor of the model comprises fined-grained blocks, and contiguous blocks are grouped into non-overlapping partitions. Each worker selects gradients in its exclusively allocated partition so that gradient build-up never occurs. To balance the workload of gradient selection between workers, ExDyna adjusts the topology of partitions by comparing the workloads of adjacent partitions. In addition, ExDyna supports online threshold scaling, which estimates the accurate threshold of gradient selection on-the-fly. Accordingly, ExDyna can satisfy the user-required sparsity level during a training period regardless of models and datasets. Therefore, ExDyna can enhance the scalability of distributed training systems by preserving near-optimal gradient sparsification cost. In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance while achieving high accuracy.
引用
收藏
页码:320 / 329
页数:10
相关论文
共 50 条
  • [41] Dynamic layer-wise sparsification for distributed deep learning
    Zhang, Hao
    Wu, Tingting
    Ma, Zhifeng
    Li, Feng
    Liu, Jie
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 147 : 1 - 15
  • [42] Near-Optimal Reinforcement Learning in Polynomial Time
    Michael Kearns
    Satinder Singh
    Machine Learning, 2002, 49 : 209 - 232
  • [43] LSDDL: Layer-Wise Sparsification for Distributed Deep Learning
    Hong, Yuxi
    Han, Peng
    BIG DATA RESEARCH, 2021, 26
  • [44] Near-optimal learning with average Holder smoothness
    Hanneke, Steve
    Kontorovich, Aryeh
    Kornowski, Guy
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [45] Near-optimal reinforcement learning in polynomial time
    Kearns, M
    Singh, S
    MACHINE LEARNING, 2002, 49 (2-3) : 209 - 232
  • [46] Near-optimal Regret Bounds for Reinforcement Learning
    Jaksch, Thomas
    Ortner, Ronald
    Auer, Peter
    JOURNAL OF MACHINE LEARNING RESEARCH, 2010, 11 : 1563 - 1600
  • [47] Near-optimal regret bounds for reinforcement learning
    Jaksch, Thomas
    Ortner, Ronald
    Auer, Peter
    Journal of Machine Learning Research, 2010, 11 : 1563 - 1600
  • [48] Advanced Policy Learning Near-Optimal Regulation
    Ding Wang
    Xiangnan Zhong
    IEEE/CAA Journal of Automatica Sinica, 2019, 6 (03) : 743 - 749
  • [49] Near-optimal Reinforcement Learning in Factored MDPs
    Osband, Ian
    Van Roy, Benjamin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
  • [50] Advanced Policy Learning Near-Optimal Regulation
    Wang, Ding
    Zhong, Xiangnan
    IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2019, 6 (03) : 743 - 749