ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments

被引:0
|
作者
Shen, Zhaoyan [1 ]
Tang, Qingxiang [1 ]
Zhou, Tianren [1 ]
Zhang, Yuhao [2 ]
Jia, Zhiping [1 ]
Yu, Dongxiao [1 ]
Zhang, Zhiyong [3 ]
Li, Bingzhe [4 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[3] Quan Cheng Lab, Jinan 250103, Peoples R China
[4] Univ Texas Dallas, Comp Sci Dept, Richardson, TX 75080 USA
关键词
Distributed deep learning; parameter server; data parallelism; OPTIMIZATION;
D O I
10.1109/TC.2023.3315847
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increment of data sets and models sizes, distributed deep learning has been proposed to accelerate training and improve the accuracy of DNN models. The parameter server framework is a popular collaborative architecture for data-parallel training, which works well for homogeneous environments by properly aggregating the computation/communication capabilities of different workers. However, in heterogeneous environments, the resources of different workers vary a lot. Some stragglers may seriously limit the whole speed, which impacts the overall training process. In this paper, we propose an adaptive multi-stage distributed deep learning training framework, named ASHL, for heterogeneous environments. First, a profiling scheme is proposed to capture the capabilities of each worker to reasonably plan the training and communication tasks on each worker, and lay the foundation for the formal training. Second, a hybrid-mode training scheme (i.e., coarse-grained and fined-grained training) is proposed to balance the model accuracy and training speed. The coarse-grained training scheme (named AHL) adopts an asynchronous communication strategy, which involves less frequent communications. Its main goal is to make the model quickly converge to a certain level. The fine-grained training stage (named SHL) uses a semi-asynchronous communication strategy and adopts a high communication frequency. Its main goal is to improve the model convergence effect. Finally, a compression-based communication scheme is proposed to further increase the communication efficiency of the training process. Our experimental results show that ASHL reduces the overall training time by more than 35% to converge to the same degree and has better generalization ability compared with state-of-the-art schemes like ADSP.
引用
收藏
页码:30 / 43
页数:14
相关论文
共 50 条
  • [1] Multi-stage Gradient Compression: Overcoming the Communication Bottleneck in Distributed Deep Learning
    Lu, Qu
    Liu, Wantao
    Han, Jizhong
    Guo, Jinrong
    NEURAL INFORMATION PROCESSING (ICONIP 2018), PT I, 2018, 11301 : 107 - 119
  • [2] A Multi-stage Adaptive Binarization Scheme for Document Images
    Duan, Jiang
    Zhang, Mengyang
    Li, Qing
    INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL SCIENCES AND OPTIMIZATION, VOL 1, PROCEEDINGS, 2009, : 867 - +
  • [3] A deep reinforcement learning method for multi-stage equipment development planning in uncertain environments
    Liu, Peng
    Xia, Boyuan
    Yang, Zhiwei
    Li, Jichao
    Tan, Yuejin
    JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2022, 33 (06) : 1159 - 1175
  • [4] A deep reinforcement learning method for multi-stage equipment development planning in uncertain environments
    LIU Peng
    XIA Boyuan
    YANG Zhiwei
    LI Jichao
    TAN Yuejin
    Journal of Systems Engineering and Electronics, 2022, 33 (06) : 1159 - 1175
  • [5] Decentralized Distributed Deep Learning in Heterogeneous WAN Environments
    Hong, Rankyung
    Chandra, Abhishek
    PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 505 - 505
  • [6] Multi-stage malaria parasite recognition by deep learning
    Li, Sen
    Du, Zeyu
    Meng, Xiangjie
    Zhang, Yang
    GIGASCIENCE, 2021, 10 (06):
  • [7] Multi-Stage Contextual Deep Learning for Pedestrian Detection
    Zeng, Xingyu
    Ouyang, Wanli
    Wang, Xiaogang
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 121 - 128
  • [8] Improved Multi-Stage Clustering Based Blind Equalisation in Distributed Environments
    Mitra, Rangeet
    Bhatia, Vimal
    2014 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (ICIT), 2014, : 1 - 5
  • [9] A Multi-Stage Advanced Deep Learning Graphics Pipeline
    Harris, Mark Wesley
    Semwal, Sudhanshu Kumar
    PROCEEDINGS OF SIGGRAPH ASIA 2021 TECHNICAL COMMUNICATIONS, 2021,
  • [10] A novel multi-stage distributed authentication scheme for smart meter communication
    Hegde, Manjunath
    Anwar, Adnan
    Kotegar, Karunakar
    Baig, Zubair
    Doss, Robin
    PEERJ COMPUTER SCIENCE, 2021, : 1 - 36