Deep Neural Network Training With Distributed K-FAC

被引:0
|
作者
Pauloski, J. Gregory [1 ]
Huang, Lei [2 ]
Xu, Weijia [2 ]
Chard, Kyle [1 ]
Foster, Ian T. [1 ]
Zhang, Zhao [2 ]
机构
[1] Univ Chicago, Dept Comp Sci, Chicago, IL 60637 USA
[2] Texas Adv Comp Ctr, Austin, TX 78758 USA
关键词
Training; Parallel processing; Program processors; Convergence; Computational modeling; Data models; Deep learning; Optimization methods; neural networks; scalability; high-performance computing; OPTIMIZATION;
D O I
10.1109/TPDS.2022.3161187
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradient descent and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, preconditions gradients with an efficient approximation of the Fisher Information Matrix to improve per-iteration progress when optimizing an objective function. Here we propose a scalable K-FAC algorithm and investigate K-FAC's applicability in large-scale deep neural network training. Specifically, we explore layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling, with the goal of preserving convergence while minimizing training time. We evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9-25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.
引用
收藏
页码:3616 / 3627
页数:12
相关论文
共 50 条
  • [1] Convolutional Neural Network Training with Distributed K-FAC
    Pauloski, J. Gregory
    Zhang, Zhao
    Huang, Lei
    Xu, Weijia
    Foster, Ian T.
    [J]. PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
  • [2] Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning
    Zhang, Lin
    Shi, Shaohuai
    Wang, Wei
    Li, Bo
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (03) : 2365 - 2378
  • [3] Inefficiency of K-FAC for Large Batch Size Training
    Ma, Linjian
    Montague, Gabe
    Ye, Jiayu
    Yao, Zhewei
    Gholami, Amir
    Keutzer, Kurt
    Mahoney, Michael W.
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 5053 - 5060
  • [4] Optimizing Q-Learning with K-FAC AlgorithmOptimizing Q-Learning with K-FAC Algorithm
    Beltiukov, Roman
    [J]. ANALYSIS OF IMAGES, SOCIAL NETWORKS AND TEXTS (AIST 2019), 2020, 1086 : 3 - 8
  • [5] Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks
    Shi, Shaohuai
    Zhang, Lin
    Li, Bo
    [J]. 2021 IEEE 41ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2021), 2021, : 550 - 560
  • [6] Distributed Deep Neural Network Training on Edge Devices
    Benditkis, Daniel
    Keren, Aviv
    Mor-Yosef, Liron
    Avidor, Tomer
    Shoham, Neta
    Tal-Israel, Nadav
    [J]. SEC'19: PROCEEDINGS OF THE 4TH ACM/IEEE SYMPOSIUM ON EDGE COMPUTING, 2019, : 304 - 306
  • [7] 基于Sherman-Morrison公式的K-FAC算法
    刘小雷
    高凯新
    王勇
    [J]. 计算机系统应用, 2021, 30 (04) : 118 - 124
  • [8] Randomized K-FACs: Speeding Up K-FAC with Randomized Numerical Linear Algebra
    Puiu, Constantin Octavian
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2022, 2022, 13756 : 411 - 422
  • [9] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Castello, Adrian
    Quintana-Orti, Enrique S.
    Duato, Jose
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
  • [10] Accelerating distributed deep neural network training with pipelined MPI allreduce
    Adrián Castelló
    Enrique S. Quintana-Ortí
    José Duato
    [J]. Cluster Computing, 2021, 24 : 3797 - 3813