Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning

被引:0
|
作者
Zhang, Lin [1 ]
Shi, Shaohuai [2 ]
Wang, Wei [1 ]
Li, Bo [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China
[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen 518055, Guangdong, Peoples R China
关键词
Training; Computational modeling; Clustering algorithms; Graphics processing units; Memory management; Deep learning; Convergence; Distributed deep learning; K-FAC; performance optimization; second-order; NATURAL GRADIENT;
D O I
10.1109/TCC.2022.3205918
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC algorithms require to compute and communicate a large volume of second-order information, i.e., Kronecker factors (KFs), before preconditioning gradients, resulting in large computation and communication overheads as well as a high memory footprint. In this article, we propose DP-KFAC, a novel distributed preconditioning scheme that distributes the KF constructing tasks at different DNN layers to different workers. DP-KFAC not only retains the convergence property of the existing D-KFAC algorithms but also enables three benefits: reduced computation overhead in constructing KFs, no communication of KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update compared to the state-of-the-art D-KFAC methods.
引用
收藏
页码:2365 / 2378
页数:14
相关论文
共 50 条
  • [21] SoftMemoryBox II: A Scalable, Shared Memory Buffer Framework for Accelerating Distributed Training of Large-Scale Deep Neural Networks
    Ahn, Shinyoung
    Lim, Eunji
    [J]. IEEE ACCESS, 2020, 8 : 207097 - 207111
  • [22] Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks
    Tsuzuku, Yusuke
    Sato, Issei
    Sugiyama, Masashi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [23] Scalable Quantitative Verification For Deep Neural Networks
    Baluta, Teodora
    Chua, Zheng Leong
    Meel, Kuldeep S.
    Saxena, Prateek
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2021), 2021, : 248 - 249
  • [24] Scalable Quantitative Verification For Deep Neural Networks
    Baluta, Teodora
    Chua, Zlieng Leong
    Meel, Kuldeep S.
    Saxena, Prateek
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 312 - 323
  • [25] A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks
    Barrachina, Sergio
    Castello, Adrian
    Catalan, Mar
    Dolz, Manuel F.
    Mestre, Jose, I
    [J]. 2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 730 - 739
  • [26] DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks
    Ye, Qing
    Zhou, Yuhao
    Shi, Mingjia
    Sun, Yanan
    Lv, Jiancheng
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2023, 7 (04): : 1217 - 1227
  • [27] Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
    Keuper, Janis
    Preundt, Franz-Josef
    [J]. PROCEEDINGS OF 2016 2ND WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC), 2016, : 19 - 26
  • [28] Model-Aware Parallelization Strategy for Deep Neural Networks' Distributed Training
    Yang, Zhaoyi
    Dong, Fang
    [J]. 2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 61 - 66
  • [29] Distributed Repair of Deep Neural Networks
    Calsi, Davide Li
    Duran, Matias
    Zhang, Xiao-Yi
    Arcaini, Paolo
    Ishikawa, Fuyuki
    [J]. 2023 IEEE CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION, ICST, 2023, : 83 - 94
  • [30] MULTILINGUAL TRAINING OF DEEP NEURAL NETWORKS
    Ghoshal, Arnab
    Swietojanski, Pawel
    Renals, Steve
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7319 - 7323