SDPIPE: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training

被引：3

作者：

Miao, Xupeng ^{[1
]}

Shi, Yining ^{[2
]}

Yang, Zhi ^{[2
]}

Cui, Bin ^{[2
]}

Jia, Zhihao ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Peking Univ, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 09期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

ALGORITHMS;

D O I：

10.14778/3598581.3598604

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The increasing size of both deep learning models and training data necessitates the ability to scale out model training through pipeline-parallel training, which combines pipelined model parallelism and data parallelism. However, most of them assume an ideal homogeneous dedicated cluster. As for real cloud clusters, these approaches su.er from the intensive model synchronization overheads due to the dynamic environment heterogeneity. Such a huge challenge leaves the design in a dilemma: either the performance bottleneck of the central parameter server (PS) or severe performance degradation caused by stragglers for decentralized synchronization (like All-Reduce). This approach presents SDPIPE, a new semi-decentralized framework to get the best of both worlds, achieving both high heterogeneity tolerance and convergence e.ciency in pipeline-parallel training. To provide high performance, we decentralize the communication model synchronization, which accounts for the largest proportion of synchronization overhead. In contrast, we centralize the process of group scheduling, which is lightweight but needs a global view for better performance and convergence speed against heterogeneity. We show via a prototype implementation the signi.cant advantage of SDP... on performance and scalability, facing di.erent environments.

引用

页码：2354 / 2363

页数：10

共 31 条

[1] Hop: Heterogeneity-aware Decentralized Training
Luo, Qinyi
Lin, Jinkun
Zhuo, Youwei
Qian, Xuehai
TWENTY-FOURTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXIV), 2019, : 893 - 907
[2] HADFL: Heterogeneity-aware Decentralized Federated Learning Framework
Cao, Jing
Lian, Zirui
Liu, Weihong
Zhu, Zongwei
Ji, Cheng
2021 58TH ACM/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2021, : 1 - 6
[3] Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training
Luo, Qinyi
He, Jiaao
Zhuo, Youwei
Qian, Xuehai
TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 401 - 416
[4] Memory-Efficient Pipeline-Parallel DNN Training
Narayanan, Deepak
Phanishayee, Amar
Shi, Kaiyu
Chen, Xie
Zaharia, Matei
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[5] A semi-decentralized security framework for Connected and Autonomous Vehicles
Carvajal-Roca, Ivan E.
Wang, Jian
2021 IEEE 94TH VEHICULAR TECHNOLOGY CONFERENCE (VTC2021-FALL), 2021,
[6] Semi-Decentralized Federated Edge Learning With Data and Device Heterogeneity
Sun, Yuchang
Shao, Jiawei
Mao, Yuyi
Wang, Jessie Hui
Zhang, Jun
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2023, 20 (02): : 1487 - 1501
[7] Federated Learning With Heterogeneity-Aware Probabilistic Synchronous Parallel on Edge
Zhao, Jianxin
Han, Rui
Yang, Yongkai
Catterall, Benjamin
Liu, Chi Harold
Chen, Lydia Y.
Mortier, Richard
Crowcroft, Jon
Wang, Liang
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (02) : 614 - 626
[8] AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training
Chen Y.
Yan Y.
Yang Q.
Shu Y.
He S.
Shi Z.
Chen J.
IEEE Transactions on Mobile Computing, 2024, 23 (12) : 1 - 15
[9] A Migratory Heterogeneity-Aware Data Layout Scheme for Parallel File Systems
He, Shuibing
Sun, Xian-He
Wang, Yang
Xu, Chengzhong
2018 32ND IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2018, : 1133 - 1142
[10] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce
Miao, Xupeng
Nie, Xiaonan
Shao, Yingxia
Yang, Zhi
Jiang, Jiawei
Ma, Lingxiao
Cui, Bin
SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2262 - 2270

← 1 2 3 4 →