Preserving Near-Optimal Gradient Sparsification Cost for Scalable Distributed Deep Learning

被引:0
|
作者
Yoon, Daegun [1 ]
Oh, Sangyoon [2 ]
机构
[1] ETRI, Daejeon, South Korea
[2] Ajou Univ, Suwon, South Korea
关键词
distributed deep learning; gradient sparsification; scalability;
D O I
10.1109/CCGrid59990.2024.00043
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Communication overhead is a major obstacle to scaling distributed training systems. Gradient sparsification is a potential optimization approach to reduce the communication volume without significant loss of model fidelity. However, existing gradient sparsification methods have low scalability owing to inefficient design of their algorithms, which raises the communication overhead significantly. In particular, gradient build-up and inadequate sparsity control methods degrade the sparsification performance considerably. Moreover, communication traffic increases drastically owing to workload imbalance of gradient selection between workers. To address these challenges, we propose a novel gradient sparsification scheme called ExDyna. In ExDyna, the gradient tensor of the model comprises fined-grained blocks, and contiguous blocks are grouped into non-overlapping partitions. Each worker selects gradients in its exclusively allocated partition so that gradient build-up never occurs. To balance the workload of gradient selection between workers, ExDyna adjusts the topology of partitions by comparing the workloads of adjacent partitions. In addition, ExDyna supports online threshold scaling, which estimates the accurate threshold of gradient selection on-the-fly. Accordingly, ExDyna can satisfy the user-required sparsity level during a training period regardless of models and datasets. Therefore, ExDyna can enhance the scalability of distributed training systems by preserving near-optimal gradient sparsification cost. In experiments, ExDyna outperformed state-of-the-art sparsifiers in terms of training speed and sparsification performance while achieving high accuracy.
引用
收藏
页码:320 / 329
页数:10
相关论文
共 50 条
  • [21] Scalable Near-Optimal Recursive Structure From Motion
    Fakih, Adel
    Zelek, John
    2009 CANADIAN CONFERENCE ON COMPUTER AND ROBOT VISION, 2009, : 23 - 30
  • [22] A Distributed Near-Optimal LSH-based Framework for Privacy-Preserving Record Linkage
    Karapiperis, Dimitrios
    Verykios, Vassilios S.
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2014, 11 (02) : 745 - 763
  • [23] Near-Optimal Distributed Routing with Low Memory
    Elkin, Michael
    Neiman, Ofer
    PODC'18: PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON PRINCIPLES OF DISTRIBUTED COMPUTING, 2018, : 207 - 216
  • [24] Deterministic near-optimal distributed listing of cliques
    Censor-Hillel, Keren
    Leitersdorf, Dean
    Vulakh, David
    DISTRIBUTED COMPUTING, 2024, 37 (04) : 363 - 385
  • [25] Safe Learning for Near-Optimal Scheduling
    Busatto-Gaston, Damien
    Chakraborty, Debraj
    Guha, Shibashis
    Perez, Guillermo A.
    Raskin, Jean-Francois
    QUANTITATIVE EVALUATION OF SYSTEMS (QEST 2021), 2021, 12846 : 235 - 254
  • [26] Near-Optimal Collaborative Learning in Bandits
    Reda, Clemence
    Vakili, Sattar
    Kaufmann, Emilie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [27] BOUNDS FOR THE ADDITIONAL COST OF NEAR-OPTIMAL CONTROLS
    STEINBERG, AM
    FORTE, I
    JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS, 1980, 31 (03) : 385 - 395
  • [28] Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
    Wu, Tianfu
    Zhu, Song-Chun
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 753 - 760
  • [29] Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection
    Wu, Tianfu
    Zhu, Song-Chun
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2015, 37 (05) : 1013 - 1027
  • [30] MobiPack: Optimal hitless SONET defragmentation in near-optimal cost
    Acharya, S
    Gupta, B
    Risbood, P
    Srivastava, A
    IEEE INFOCOM 2004: THE CONFERENCE ON COMPUTER COMMUNICATIONS, VOLS 1-4, PROCEEDINGS, 2004, : 1819 - 1829