Survey on Network of Distributed Deep Learning Training

被引:0
|
作者
Zhu H. [1 ,2 ]
Yuan G. [1 ]
Yao C. [3 ]
Tan G. [1 ]
Wang Z. [1 ]
Hu Z. [1 ,2 ,3 ]
Zhang X. [1 ,2 ,3 ]
An X. [1 ]
机构
[1] Institute of Computing Technology, Chinese Academy of Sciences, Beijing
[2] University of Chinese Academy of Sciences, Beijing
[3] Megvii Inc., Beijing
基金
中国国家自然科学基金;
关键词
Cluster network; Collective communication; Communication network; Deep learning; Distributed calculating; Performance optimization;
D O I
10.7544/issn1000-1239.2021.20190881
中图分类号
学科分类号
摘要
In recent years, deep learning has achieved better results than traditional algorithms in many fields such as image, speech, and natural language processing. People are increasingly demanding training speed and data processing capabilities for deep learning. However, the calculating ability of a single server has a limit and cannot achieve human demands. Distributed deep learning training has become the most effective method to expand deep learning training computing ability. At present, distributed deep learning faces a training bottleneck due to communication problems in the network during the training process which leads the communication network to be the most influential factor. There are currently many network performance optimization researches for distributed deep learning. In this paper, the main performance bottlenecks and optimization schemes are firstly demonstrated. Then the current state-of-art ultra-large-scale distributed training architecture and methods for optimization performance are specifically analyzed. Finally, a comparative summary of each performance optimization scheme and the difficulties still existing in distributed deep learning training are given, and the future research directions are pointed out as well. © 2021, Science Press. All right reserved.
引用
收藏
页码:98 / 115
页数:17
相关论文
共 71 条
  • [1] Zhang Chenlin, Zhang Hao, Wei Xiushen, Et al., Deep bimodal regression for apparent personality analysis, Proc of European Conf on Computer Vision, pp. 311-324, (2016)
  • [2] Lee H, Pham P, Largman Y, Et al., Unsupervised feature learning for audio classification using convolutional deep belief networks, Proc of Advances in Neural Information Processing Systems, pp. 1096-1104, (2009)
  • [3] Young T, Hazarika D, Poria S, Et al., Recent trends in deep learning based natural language processing, IEEE Computational Intelligence Magazine, 13, 3, pp. 55-75, (2018)
  • [4] Ivanitskaya L, Clark D, Montgomery G, Et al., Interdisci-plinary learning: Process and outcomes, Innovative Higher Education, 27, 2, pp. 95-111, (2002)
  • [5] Deng Jia, Dong Wei, Socher R, Et al., ImageNet: A large-scale hierarchical image database, Proc of IEEE Conf on Computer Vision and Pattern Recognition, pp. 248-255, (2009)
  • [6] Szegedy C, Ioffe S, Vanhoucke V, Et al., Inception-v4, inception-ResNet and the impact of residual connections on learning, Proc of the 31st AAAI Conf on Artificial Intelligence, pp. 4278-4284, (2017)
  • [7] You Yang, Zhang Zhao, Hsieh C J, Et al., ImageNet training in minutes, (2017)
  • [8] Abadi M, Barham P, Chen Jianmin, Et al., TensorFlow: A system for large-scale machine learning, Proc of the 12th Symp on Operating Systems Design and Implementation (OSDI 16), pp. 265-283, (2016)
  • [9] Ketkar N., Deep Learning with Python: Introduction to Pytorch, (2017)
  • [10] Jia Yangqing, Shelhamer E, Donahue J, Et al., Caffe: Convolutional architecture for fast feature embedding, Proc of the 22nd ACM Int Conf on Multimedia, pp. 675-678, (2014)