SDPIPE: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training

被引:3
|
作者
Miao, Xupeng [1 ]
Shi, Yining [2 ]
Yang, Zhi [2 ]
Cui, Bin [2 ]
Jia, Zhihao [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Peking Univ, Beijing, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 09期
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
ALGORITHMS;
D O I
10.14778/3598581.3598604
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The increasing size of both deep learning models and training data necessitates the ability to scale out model training through pipeline-parallel training, which combines pipelined model parallelism and data parallelism. However, most of them assume an ideal homogeneous dedicated cluster. As for real cloud clusters, these approaches su.er from the intensive model synchronization overheads due to the dynamic environment heterogeneity. Such a huge challenge leaves the design in a dilemma: either the performance bottleneck of the central parameter server (PS) or severe performance degradation caused by stragglers for decentralized synchronization (like All-Reduce). This approach presents SDPIPE, a new semi-decentralized framework to get the best of both worlds, achieving both high heterogeneity tolerance and convergence e.ciency in pipeline-parallel training. To provide high performance, we decentralize the communication model synchronization, which accounts for the largest proportion of synchronization overhead. In contrast, we centralize the process of group scheduling, which is lightweight but needs a global view for better performance and convergence speed against heterogeneity. We show via a prototype implementation the signi.cant advantage of SDP... on performance and scalability, facing di.erent environments.
引用
收藏
页码:2354 / 2363
页数:10
相关论文
共 31 条
  • [11] Semi-Decentralized Interference Aware Scheduling in D2D-Enabled Cellular Networks
    Kluegel, Markus
    Kellerer, Wolfgang
    IEEE ACCESS, 2020, 8 : 132285 - 132301
  • [12] mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
    Dreuning, Henk
    Bal, Henri E.
    van Nieuwpoort, Rob, V
    EURO-PAR 2022: PARALLEL PROCESSING, 2022, 13440 : 155 - 170
  • [13] Boosting Parallel File System Performance via Heterogeneity-Aware Selective Data Layout
    He, Shuibing
    Wang, Yang
    Sun, Xian-He
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (09) : 2492 - 2505
  • [14] A Holistic Heterogeneity-Aware Data Placement Scheme for Hybrid Parallel I/O Systems
    He, Shuibing
    Li, Zheng
    Zhou, Jiang
    Yin, Yanlong
    Xu, Xiaohua
    Chen, Yong
    Sun, Xian-He
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (04) : 830 - 842
  • [15] HAS: Heterogeneity-Aware Selective Layout Scheme for Parallel File Systems on Hybrid Servers
    He, Shuibing
    Sun, Xian-He
    Haider, Adnan
    2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 613 - 622
  • [16] CHRT: a Criticality- and Heterogeneity-Aware Runtime System for Task-Parallel Applications
    Han, Myeonggyun
    Park, Jinsu
    Baek, Woongki
    PROCEEDINGS OF THE 2017 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2017, : 942 - 945
  • [17] FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Approach for Heterogeneous Edge Devices
    Chen, Yuhao
    Yang, Qianqian
    He, Shibo
    Shi, Zhiguo
    Chen, Jiming
    Guizani, Mohsen
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2024, 23 (04) : 3200 - 3212
  • [18] A Heterogeneity-Aware Region-Level Data Layout for Hybrid Parallel File Systems
    He, Shuibing
    Sun, Xian-He
    Wang, Yang
    Kougkas, Antonis
    Haider, Adnan
    2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2015, : 340 - 349
  • [19] HARL: Optimizing Parallel File Systems with Heterogeneity-Aware Region-Level Data Layout
    He, Shuibing
    Wang, Yang
    Sun, Xian-He
    Xu, Chengzhong
    IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (06) : 1048 - 1060
  • [20] Design and Implementation of a Criticality- and Heterogeneity-Aware Runtime System for Task-Parallel Applications
    Han, Myeonggyun
    Park, Jinsu
    Baek, Woongki
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (05) : 1117 - 1132