Alleviating Imbalance in Synchronous Distributed Training of Deep Neural Networks

被引:0
|
作者
Lin, Haiyang [1 ,2 ]
Yan, Mingyu [1 ,2 ]
Wang, Duo [1 ,2 ]
Li, Wenming [1 ,2 ]
Ye, Xiaochun [1 ,2 ,3 ]
Tang, Zhimin [1 ,2 ]
Fan, Dongrui [1 ,2 ]
机构
[1] Chinese Acad Sci, State Key Lab Comp Architecture, ICT, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] State Key Lab Math Engn & Adv Comp, Wuxi, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
deep neural networks; synchronous distributed training; imbalance; dynamic workloads strategy; link strategy;
D O I
10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00063
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep neural networks achieve huge success in computer vision, natural language process and other fields. As larger networks and larger datasets bring better performance, the training time grows rapidly. Distributed training uses parallelization strategy to accelerate the training of deep neural network. In the existing distributed training framework of deep neural network, the parameter server architecture has poor scalability and is not suitable for large-scale distributed training. Ring/HD All Reduce algorithm perform well in scalability under ideal circumstances. However, in many realistic scenarios, imbalance will occur frequently in the distributed training system, which is mainly the imbalance of compute ability and network. Due to the structure characteristics of synchronous distributed training and All Reduce algorithm, imbalance will bring serious impact on performance. At present, there is a lack of research on both compute ability imbalance and network imbalance, especially dynamic imbalance. To alleviate this problem, we analyze All Reduce algorithm first, and propose RingDL, which optimizes the performance of distributed training system by alleviating imbalance through customizable dynamic workloads strategy and link strategy. The experimental results show that RingDL significantly reduces idle time in computing phase and synchronous phase, improving the overall performance.
引用
收藏
页码:405 / 412
页数:8
相关论文
共 50 条
  • [1] Accelerating Training for Distributed Deep Neural Networks in MapReduce
    Xu, Jie
    Wang, Jingyu
    Qi, Qi
    Sun, Haifeng
    Liao, Jianxin
    [J]. WEB SERVICES - ICWS 2018, 2018, 10966 : 181 - 195
  • [2] An In-Depth Analysis of Distributed Training of Deep Neural Networks
    Ko, Yunyong
    Choi, Kibong
    Seo, Jiwon
    Kim, Sang-Wook
    [J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 994 - 1003
  • [3] Distributed training of deep neural networks with spark: The MareNostrum experience
    Cruz, Leonel
    Tous, Ruben
    Otero, Beatriz
    [J]. PATTERN RECOGNITION LETTERS, 2019, 125 : 174 - 178
  • [4] Parallel and Distributed Training of Deep Neural Networks: A brief overview
    Farkas, Attila
    Kertesz, Gabor
    Lovas, Robert
    [J]. 2020 IEEE 24TH INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS (INES 2020), 2020, : 165 - 170
  • [5] A Hitchhiker's Guide On Distributed Training Of Deep Neural Networks
    Chahal, Karanbir Singh
    Grover, Manraj Singh
    Dey, Kuntal
    Shah, Rajiv Ratn
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 137 : 65 - 76
  • [6] KnowledgeNet: Disaggregated and Distributed Training and Serving of Deep Neural Networks
    Biookaghazadeh, Saman
    Chen, Yitao
    Zhao, Kaiqi
    Zhao, Ming
    [J]. PROCEEDINGS OF THE 2019 USENIX CONFERENCE ON OPERATIONAL MACHINE LEARNING, 2019, : 47 - 49
  • [7] A novel framework to enhance the performance of training distributed deep neural networks
    Phan, Trung
    Do, Phuc
    [J]. INTELLIGENT DATA ANALYSIS, 2023, 27 (03) : 753 - 768
  • [8] DTS: A Simulator to Estimate the Training Time of Distributed Deep Neural Networks
    Robinson, Wilfredo J. M.
    Esposito, Flavio
    Zuluaga, Maria A.
    [J]. 2022 30TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS, MASCOTS, 2022, : 17 - 24
  • [9] Benchmarking network fabrics for data distributed training of deep neural networks
    Samsi, Siddharth
    Prout, Andrew
    Jones, Michael
    Kirby, Andrew
    Arcand, Bill
    Bergeron, Bill
    Bestor, David
    Byun, Chansup
    Gadepally, Vijay
    Houle, Michael
    Hubbell, Matthew
    Klein, Anna
    Michaleas, Peter
    Milechin, Lauren
    Mullen, Julie
    Rosa, Antonio
    Yee, Charles
    Reuther, Albert
    Kepner, Jeremy
    [J]. 2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
  • [10] EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks
    Li, Shengwei
    Lai, Zhiquan
    Li, Dongsheng
    Zhang, Yiming
    Ye, Xiangyu
    Duan, Yabo
    [J]. 51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022, 2022,