Alleviating Imbalance in Synchronous Distributed Training of Deep Neural Networks

被引：0

作者：

Lin, Haiyang ^{[1
,2
]}

Yan, Mingyu ^{[1
,2
]}

Wang, Duo ^{[1
,2
]}

Li, Wenming ^{[1
,2
]}

Ye, Xiaochun ^{[1
,2
,3
]}

Tang, Zhimin ^{[1
,2
]}

Fan, Dongrui ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, State Key Lab Comp Architecture, ICT, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] State Key Lab Math Engn & Adv Comp, Wuxi, Jiangsu, Peoples R China

来源：

19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021) | 2021年

基金：

中国国家自然科学基金;

关键词：

deep neural networks; synchronous distributed training; imbalance; dynamic workloads strategy; link strategy;

D O I：

10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00063

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep neural networks achieve huge success in computer vision, natural language process and other fields. As larger networks and larger datasets bring better performance, the training time grows rapidly. Distributed training uses parallelization strategy to accelerate the training of deep neural network. In the existing distributed training framework of deep neural network, the parameter server architecture has poor scalability and is not suitable for large-scale distributed training. Ring/HD All Reduce algorithm perform well in scalability under ideal circumstances. However, in many realistic scenarios, imbalance will occur frequently in the distributed training system, which is mainly the imbalance of compute ability and network. Due to the structure characteristics of synchronous distributed training and All Reduce algorithm, imbalance will bring serious impact on performance. At present, there is a lack of research on both compute ability imbalance and network imbalance, especially dynamic imbalance. To alleviate this problem, we analyze All Reduce algorithm first, and propose RingDL, which optimizes the performance of distributed training system by alleviating imbalance through customizable dynamic workloads strategy and link strategy. The experimental results show that RingDL significantly reduces idle time in computing phase and synchronous phase, improving the overall performance.

引用

页码：405 / 412

页数：8

共 50 条

[1] Accelerating Training for Distributed Deep Neural Networks in MapReduce
Xu, Jie
Wang, Jingyu
Qi, Qi
Sun, Haifeng
Liao, Jianxin
[J]. WEB SERVICES - ICWS 2018, 2018, 10966 : 181 - 195
[2] An In-Depth Analysis of Distributed Training of Deep Neural Networks
Ko, Yunyong
Choi, Kibong
Seo, Jiwon
Kim, Sang-Wook
[J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 994 - 1003
[3] Distributed training of deep neural networks with spark: The MareNostrum experience
Cruz, Leonel
Tous, Ruben
Otero, Beatriz
[J]. PATTERN RECOGNITION LETTERS, 2019, 125 : 174 - 178
[4] Parallel and Distributed Training of Deep Neural Networks: A brief overview
Farkas, Attila
Kertesz, Gabor
Lovas, Robert
[J]. 2020 IEEE 24TH INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS (INES 2020), 2020, : 165 - 170
[5] A Hitchhiker's Guide On Distributed Training Of Deep Neural Networks
Chahal, Karanbir Singh
Grover, Manraj Singh
Dey, Kuntal
Shah, Rajiv Ratn
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 137 : 65 - 76
[6] KnowledgeNet: Disaggregated and Distributed Training and Serving of Deep Neural Networks
Biookaghazadeh, Saman
Chen, Yitao
Zhao, Kaiqi
Zhao, Ming
[J]. PROCEEDINGS OF THE 2019 USENIX CONFERENCE ON OPERATIONAL MACHINE LEARNING, 2019, : 47 - 49
[7] A novel framework to enhance the performance of training distributed deep neural networks
Phan, Trung
Do, Phuc
[J]. INTELLIGENT DATA ANALYSIS, 2023, 27 (03) : 753 - 768
[8] DTS: A Simulator to Estimate the Training Time of Distributed Deep Neural Networks
Robinson, Wilfredo J. M.
Esposito, Flavio
Zuluaga, Maria A.
[J]. 2022 30TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS, MASCOTS, 2022, : 17 - 24
[9] Benchmarking network fabrics for data distributed training of deep neural networks
Samsi, Siddharth
Prout, Andrew
Jones, Michael
Kirby, Andrew
Arcand, Bill
Bergeron, Bill
Bestor, David
Byun, Chansup
Gadepally, Vijay
Houle, Michael
Hubbell, Matthew
Klein, Anna
Michaleas, Peter
Milechin, Lauren
Mullen, Julie
Rosa, Antonio
Yee, Charles
Reuther, Albert
Kepner, Jeremy
[J]. 2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
[10] EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks
Li, Shengwei
Lai, Zhiquan
Li, Dongsheng
Zhang, Yiming
Ye, Xiangyu
Duan, Yabo
[J]. 51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022, 2022,

← 1 2 3 4 5 →