Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning

被引：0

作者：

Zhang, Lin ^{[1
]}

Shi, Shaohuai ^{[2
]}

Wang, Wei ^{[1
]}

Li, Bo ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China

[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen 518055, Guangdong, Peoples R China

来源：

IEEE TRANSACTIONS ON CLOUD COMPUTING | 2023年 / 11卷 / 03期

关键词：

Training; Computational modeling; Clustering algorithms; Graphics processing units; Memory management; Deep learning; Convergence; Distributed deep learning; K-FAC; performance optimization; second-order; NATURAL GRADIENT;

D O I：

10.1109/TCC.2022.3205918

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC algorithms require to compute and communicate a large volume of second-order information, i.e., Kronecker factors (KFs), before preconditioning gradients, resulting in large computation and communication overheads as well as a high memory footprint. In this article, we propose DP-KFAC, a novel distributed preconditioning scheme that distributes the KF constructing tasks at different DNN layers to different workers. DP-KFAC not only retains the convergence property of the existing D-KFAC algorithms but also enables three benefits: reduced computation overhead in constructing KFs, no communication of KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update compared to the state-of-the-art D-KFAC methods.

引用

页码：2365 / 2378

页数：14

共 50 条

[21] SoftMemoryBox II: A Scalable, Shared Memory Buffer Framework for Accelerating Distributed Training of Large-Scale Deep Neural Networks
Ahn, Shinyoung
Lim, Eunji
[J]. IEEE ACCESS, 2020, 8 : 207097 - 207111
[22] Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks
Tsuzuku, Yusuke
Sato, Issei
Sugiyama, Masashi
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[23] Scalable Quantitative Verification For Deep Neural Networks
Baluta, Teodora
Chua, Zheng Leong
Meel, Kuldeep S.
Saxena, Prateek
[J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2021), 2021, : 248 - 249
[24] Scalable Quantitative Verification For Deep Neural Networks
Baluta, Teodora
Chua, Zlieng Leong
Meel, Kuldeep S.
Saxena, Prateek
[J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 312 - 323
[25] A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks
Barrachina, Sergio
Castello, Adrian
Catalan, Mar
Dolz, Manuel F.
Mestre, Jose, I
[J]. 2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 730 - 739
[26] DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks
Ye, Qing
Zhou, Yuhao
Shi, Mingjia
Sun, Yanan
Lv, Jiancheng
[J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2023, 7 (04): : 1217 - 1227
[27] Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Keuper, Janis
Preundt, Franz-Josef
[J]. PROCEEDINGS OF 2016 2ND WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC), 2016, : 19 - 26
[28] Model-Aware Parallelization Strategy for Deep Neural Networks' Distributed Training
Yang, Zhaoyi
Dong, Fang
[J]. 2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 61 - 66
[29] Distributed Repair of Deep Neural Networks
Calsi, Davide Li
Duran, Matias
Zhang, Xiao-Yi
Arcaini, Paolo
Ishikawa, Fuyuki
[J]. 2023 IEEE CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION, ICST, 2023, : 83 - 94
[30] MULTILINGUAL TRAINING OF DEEP NEURAL NETWORKS
Ghoshal, Arnab
Swietojanski, Pawel
Renals, Steve
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7319 - 7323

← 1 2 3 4 5 →