SDPIPE: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training

被引:3
|
作者
Miao, Xupeng [1 ]
Shi, Yining [2 ]
Yang, Zhi [2 ]
Cui, Bin [2 ]
Jia, Zhihao [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Peking Univ, Beijing, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 09期
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
ALGORITHMS;
D O I
10.14778/3598581.3598604
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The increasing size of both deep learning models and training data necessitates the ability to scale out model training through pipeline-parallel training, which combines pipelined model parallelism and data parallelism. However, most of them assume an ideal homogeneous dedicated cluster. As for real cloud clusters, these approaches su.er from the intensive model synchronization overheads due to the dynamic environment heterogeneity. Such a huge challenge leaves the design in a dilemma: either the performance bottleneck of the central parameter server (PS) or severe performance degradation caused by stragglers for decentralized synchronization (like All-Reduce). This approach presents SDPIPE, a new semi-decentralized framework to get the best of both worlds, achieving both high heterogeneity tolerance and convergence e.ciency in pipeline-parallel training. To provide high performance, we decentralize the communication model synchronization, which accounts for the largest proportion of synchronization overhead. In contrast, we centralize the process of group scheduling, which is lightweight but needs a global view for better performance and convergence speed against heterogeneity. We show via a prototype implementation the signi.cant advantage of SDP... on performance and scalability, facing di.erent environments.
引用
收藏
页码:2354 / 2363
页数:10
相关论文
共 31 条
  • [1] Hop: Heterogeneity-aware Decentralized Training
    Luo, Qinyi
    Lin, Jinkun
    Zhuo, Youwei
    Qian, Xuehai
    TWENTY-FOURTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXIV), 2019, : 893 - 907
  • [2] HADFL: Heterogeneity-aware Decentralized Federated Learning Framework
    Cao, Jing
    Lian, Zirui
    Liu, Weihong
    Zhu, Zongwei
    Ji, Cheng
    2021 58TH ACM/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2021, : 1 - 6
  • [3] Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training
    Luo, Qinyi
    He, Jiaao
    Zhuo, Youwei
    Qian, Xuehai
    TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 401 - 416
  • [4] Memory-Efficient Pipeline-Parallel DNN Training
    Narayanan, Deepak
    Phanishayee, Amar
    Shi, Kaiyu
    Chen, Xie
    Zaharia, Matei
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [5] A semi-decentralized security framework for Connected and Autonomous Vehicles
    Carvajal-Roca, Ivan E.
    Wang, Jian
    2021 IEEE 94TH VEHICULAR TECHNOLOGY CONFERENCE (VTC2021-FALL), 2021,
  • [6] Semi-Decentralized Federated Edge Learning With Data and Device Heterogeneity
    Sun, Yuchang
    Shao, Jiawei
    Mao, Yuyi
    Wang, Jessie Hui
    Zhang, Jun
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2023, 20 (02): : 1487 - 1501
  • [7] Federated Learning With Heterogeneity-Aware Probabilistic Synchronous Parallel on Edge
    Zhao, Jianxin
    Han, Rui
    Yang, Yongkai
    Catterall, Benjamin
    Liu, Chi Harold
    Chen, Lydia Y.
    Mortier, Richard
    Crowcroft, Jon
    Wang, Liang
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (02) : 614 - 626
  • [8] AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training
    Chen Y.
    Yan Y.
    Yang Q.
    Shu Y.
    He S.
    Shi Z.
    Chen J.
    IEEE Transactions on Mobile Computing, 2024, 23 (12) : 1 - 15
  • [9] A Migratory Heterogeneity-Aware Data Layout Scheme for Parallel File Systems
    He, Shuibing
    Sun, Xian-He
    Wang, Yang
    Xu, Chengzhong
    2018 32ND IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2018, : 1133 - 1142
  • [10] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce
    Miao, Xupeng
    Nie, Xiaonan
    Shao, Yingxia
    Yang, Zhi
    Jiang, Jiawei
    Ma, Lingxiao
    Cui, Bin
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2262 - 2270