ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments

被引:0
|
作者
Shen, Zhaoyan [1 ]
Tang, Qingxiang [1 ]
Zhou, Tianren [1 ]
Zhang, Yuhao [2 ]
Jia, Zhiping [1 ]
Yu, Dongxiao [1 ]
Zhang, Zhiyong [3 ]
Li, Bingzhe [4 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[3] Quan Cheng Lab, Jinan 250103, Peoples R China
[4] Univ Texas Dallas, Comp Sci Dept, Richardson, TX 75080 USA
关键词
Distributed deep learning; parameter server; data parallelism; OPTIMIZATION;
D O I
10.1109/TC.2023.3315847
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increment of data sets and models sizes, distributed deep learning has been proposed to accelerate training and improve the accuracy of DNN models. The parameter server framework is a popular collaborative architecture for data-parallel training, which works well for homogeneous environments by properly aggregating the computation/communication capabilities of different workers. However, in heterogeneous environments, the resources of different workers vary a lot. Some stragglers may seriously limit the whole speed, which impacts the overall training process. In this paper, we propose an adaptive multi-stage distributed deep learning training framework, named ASHL, for heterogeneous environments. First, a profiling scheme is proposed to capture the capabilities of each worker to reasonably plan the training and communication tasks on each worker, and lay the foundation for the formal training. Second, a hybrid-mode training scheme (i.e., coarse-grained and fined-grained training) is proposed to balance the model accuracy and training speed. The coarse-grained training scheme (named AHL) adopts an asynchronous communication strategy, which involves less frequent communications. Its main goal is to make the model quickly converge to a certain level. The fine-grained training stage (named SHL) uses a semi-asynchronous communication strategy and adopts a high communication frequency. Its main goal is to improve the model convergence effect. Finally, a compression-based communication scheme is proposed to further increase the communication efficiency of the training process. Our experimental results show that ASHL reduces the overall training time by more than 35% to converge to the same degree and has better generalization ability compared with state-of-the-art schemes like ADSP.
引用
收藏
页码:30 / 43
页数:14
相关论文
共 50 条
  • [41] A Distributed Algorithm for Multi-Stage Computation Offloading
    Mahn, Tobias
    Becker, Dennis
    Al-Shatri, Hussein
    Klein, Anja
    2018 IEEE 7TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET), 2018,
  • [42] A Multi-stage Theory of Neurofeedback Learning
    Davelaar, Eddy J.
    AUGMENTED COGNITION. THEORETICAL AND TECHNOLOGICAL APPROACHES, AC 2020, PT I, 2020, 12196 : 118 - 128
  • [44] Eye Disease Detection Enhancement Using a Multi-Stage Deep Learning Approach
    Muntaqim, Md Zahin
    Smrity, Tangin Amir
    Miah, Abu Saleh Musa
    Kafi, Hasan Muhammad
    Tamanna, Taosin
    Farid, Fahmid Al
    Rahim, Md Abdur
    Karim, Hezerul Abdul
    Mansor, Sarina
    IEEE Access, 2024, 12 : 191393 - 191407
  • [45] Robot Grasp in Cluttered Scene Using a Multi-Stage Deep Learning Model
    Wei, Dujia
    Cao, Jianmin
    Gu, Ye
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (07): : 6512 - 6519
  • [46] Multi-stage deep learning approaches to predict boarding behaviour of bus passengers
    Tang, Tianli
    Fonzone, Achille
    Liu, Ronghui
    Choudhury, Charisma
    SUSTAINABLE CITIES AND SOCIETY, 2021, 73
  • [47] Multi-stage Deep Learning Technique with a Cascaded Classifier for Turn Lanes Recognition
    Sanjeewani, Pubudu
    Verma, Brijesh
    Affum, Joseph
    2021 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2021), 2021,
  • [48] A Novel Multi-Stage Bispectral Deep Learning Method for Protein Family Classification
    Al Fahoum, Amjed
    Zyout, Ala'a
    Alquran, Hiam
    Abu-Qasmieh, Isam
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 76 (01): : 1173 - 1193
  • [49] A distributed traffic control scheme for large-scale multi-stage ATM switching systems
    Nakai, K
    Oki, E
    Yamanaka, N
    IEICE TRANSACTIONS ON COMMUNICATIONS, 2000, E83B (02) : 231 - 237
  • [50] Adaptive Contrast Weighted Learning for Multi-Stage Multi-Treatment Decision-Making
    Tao, Yebin
    Wang, Lu
    BIOMETRICS, 2017, 73 (01) : 145 - 155