Performance Modeling and Analysis of Distributed Deep Neural Network Training with Parameter Server

被引:0
|
作者
Zhang, Xuan [1 ]
Zhang, Jiao [1 ,2 ]
Wei, Dehui [1 ]
Pan, Tian [1 ,2 ]
Huang, Tao [1 ,2 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
[2] Purple Mt Labs, Nanjing, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Distributed Training; Performance Modeling; Communication Optimization;
D O I
10.1109/GLOBECOM54140.2023.10436745
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
With the growth of dataset size and the development of hardware accelerators, the application of deep neural networks (DNN) in various fields has made great breakthroughs. In order to improve the training speed of DNN, distributed training has been widely used. However, the imbalance between computation and communication makes distributed training difficult to achieve maximum efficiency. Therefore there is a need to detect the bottleneck state and verify the effect of some optimization schemes. Testing on a physical cluster incurs additional time and cost overhead. This paper builds a DNN-specific performance model that is used for bottleneck detection and tuning at a low cost. We build this model through detailed analysis and reasonable assumptions. We also focus on fine-grained modeling of scalability and network components, which are key factors affecting performance. Then we verify the performance model with an average error of 5% on testbed and emulator. Finally, we provide use cases of the performance model.
引用
收藏
页码:4140 / 4145
页数:6
相关论文
共 50 条
  • [1] Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
    Luo, Liang
    Nelson, Jacob
    Ceze, Luis
    Phanishayee, Amar
    Krishnamurthy, Arvind
    [J]. PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 41 - 54
  • [2] H-PS: A Heterogeneous-Aware Parameter Server With Distributed Neural Network Training
    Xian, Lintao
    Li, Bingzhe
    Liu, Jing
    Guo, Zhongwen
    Du, David H. C.
    [J]. IEEE ACCESS, 2021, 9 : 44049 - 44058
  • [3] Distributed Deep Neural Network Training on Edge Devices
    Benditkis, Daniel
    Keren, Aviv
    Mor-Yosef, Liron
    Avidor, Tomer
    Shoham, Neta
    Tal-Israel, Nadav
    [J]. SEC'19: PROCEEDINGS OF THE 4TH ACM/IEEE SYMPOSIUM ON EDGE COMPUTING, 2019, : 304 - 306
  • [4] Performance Modeling for Distributed Training of Convolutional Neural Networks
    Castello, Adrian
    Catalan, Mar
    Dolz, Manuel F.
    Mestre, Jose, I
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. 2021 29TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2021), 2021, : 99 - 108
  • [5] Deep Neural Network Training With Distributed K-FAC
    Pauloski, J. Gregory
    Huang, Lei
    Xu, Weijia
    Chard, Kyle
    Foster, Ian T.
    Zhang, Zhao
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (12) : 3616 - 3627
  • [6] A Spatiotemporal Neural Network Modeling Method for Nonlinear Distributed Parameter Systems
    Lu, XinJiang
    Cui, Xiangbo
    [J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2021, 17 (03) : 1916 - 1926
  • [7] Distributed Parameter Modeling for Coupled Striplines Based on Artificial Neural Network
    Xiao, Qi
    Tang, Min
    Liu, Zhiyuan
    Mao, Junfa
    [J]. 2022 IEEE 10TH ASIA-PACIFIC CONFERENCE ON ANTENNAS AND PROPAGATION, APCAP, 2022,
  • [8] Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training
    Liu, Ting
    Miao, Tianhao
    Wu, Qinghua
    Li, Zhenyu
    He, Guangxin
    Wu, Jiaoren
    Zhang, Shengzhuo
    Yang, Xingwu
    Tyson, Gareth
    Xie, Gaogang
    [J]. PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 1764 - 1773
  • [9] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Castello, Adrian
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
  • [10] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Adrián Castelló
    Enrique S. Quintana-Ortí
    José Duato
    [J]. Cluster Computing, 2021, 24 : 3797 - 3813