Performance Modeling and Analysis of Distributed Deep Neural Network Training with Parameter Server

被引：0

作者：

Zhang, Xuan ^{[1
]}

Zhang, Jiao ^{[1
,2
]}

Wei, Dehui ^{[1
]}

Pan, Tian ^{[1
,2
]}

Huang, Tao ^{[1
,2
]}

机构：

[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China

[2] Purple Mt Labs, Nanjing, Peoples R China

来源：

IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Distributed Training; Performance Modeling; Communication Optimization;

D O I：

10.1109/GLOBECOM54140.2023.10436745

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

With the growth of dataset size and the development of hardware accelerators, the application of deep neural networks (DNN) in various fields has made great breakthroughs. In order to improve the training speed of DNN, distributed training has been widely used. However, the imbalance between computation and communication makes distributed training difficult to achieve maximum efficiency. Therefore there is a need to detect the bottleneck state and verify the effect of some optimization schemes. Testing on a physical cluster incurs additional time and cost overhead. This paper builds a DNN-specific performance model that is used for bottleneck detection and tuning at a low cost. We build this model through detailed analysis and reasonable assumptions. We also focus on fine-grained modeling of scalability and network components, which are key factors affecting performance. Then we verify the performance model with an average error of 5% on testbed and emulator. Finally, we provide use cases of the performance model.

引用

页码：4140 / 4145

页数：6

共 50 条

[1] Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
Luo, Liang
Nelson, Jacob
Ceze, Luis
Phanishayee, Amar
Krishnamurthy, Arvind
[J]. PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, : 41 - 54
[2] H-PS: A Heterogeneous-Aware Parameter Server With Distributed Neural Network Training
Xian, Lintao
Li, Bingzhe
Liu, Jing
Guo, Zhongwen
Du, David H. C.
[J]. IEEE ACCESS, 2021, 9 : 44049 - 44058
[3] Distributed Deep Neural Network Training on Edge Devices
Benditkis, Daniel
Keren, Aviv
Mor-Yosef, Liron
Avidor, Tomer
Shoham, Neta
Tal-Israel, Nadav
[J]. SEC'19: PROCEEDINGS OF THE 4TH ACM/IEEE SYMPOSIUM ON EDGE COMPUTING, 2019, : 304 - 306
[4] Performance Modeling for Distributed Training of Convolutional Neural Networks
Castello, Adrian
Catalan, Mar
Dolz, Manuel F.
Mestre, Jose, I
Quintana-Orti, Enrique S.
Duato, Jose
[J]. 2021 29TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2021), 2021, : 99 - 108
[5] Deep Neural Network Training With Distributed K-FAC
Pauloski, J. Gregory
Huang, Lei
Xu, Weijia
Chard, Kyle
Foster, Ian T.
Zhang, Zhao
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (12) : 3616 - 3627
[6] A Spatiotemporal Neural Network Modeling Method for Nonlinear Distributed Parameter Systems
Lu, XinJiang
Cui, Xiangbo
[J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2021, 17 (03) : 1916 - 1926
[7] Distributed Parameter Modeling for Coupled Striplines Based on Artificial Neural Network
Xiao, Qi
Tang, Min
Liu, Zhiyuan
Mao, Junfa
[J]. 2022 IEEE 10TH ASIA-PACIFIC CONFERENCE ON ANTENNAS AND PROPAGATION, APCAP, 2022,
[8] Modeling and Optimizing the Scaling Performance in Distributed Deep Learning Training
Liu, Ting
Miao, Tianhao
Wu, Qinghua
Li, Zhenyu
He, Guangxin
Wu, Jiaoren
Zhang, Shengzhuo
Yang, Xingwu
Tyson, Gareth
Xie, Gaogang
[J]. PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 1764 - 1773
[9] Accelerating distributed deep neural network training with pipelined MPI allreduce
Castello, Adrian
Quintana-Orti, Enrique S.
Duato, Jose
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
[10] Accelerating distributed deep neural network training with pipelined MPI allreduce
Adrián Castelló
Enrique S. Quintana-Ortí
José Duato
[J]. Cluster Computing, 2021, 24 : 3797 - 3813

← 1 2 3 4 5 →