Deep Neural Network Training With Distributed K-FAC

被引：0

作者：

Pauloski, J. Gregory ^{[1
]}

Huang, Lei ^{[2
]}

Xu, Weijia ^{[2
]}

Chard, Kyle ^{[1
]}

Foster, Ian T. ^{[1
]}

Zhang, Zhao ^{[2
]}

机构：

[1] Univ Chicago, Dept Comp Sci, Chicago, IL 60637 USA

[2] Texas Adv Comp Ctr, Austin, TX 78758 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 12期

关键词：

Training; Parallel processing; Program processors; Convergence; Computational modeling; Data models; Deep learning; Optimization methods; neural networks; scalability; high-performance computing; OPTIMIZATION;

D O I：

10.1109/TPDS.2022.3161187

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradient descent and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, preconditions gradients with an efficient approximation of the Fisher Information Matrix to improve per-iteration progress when optimizing an objective function. Here we propose a scalable K-FAC algorithm and investigate K-FAC's applicability in large-scale deep neural network training. Specifically, we explore layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling, with the goal of preserving convergence while minimizing training time. We evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9-25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.

引用

页码：3616 / 3627

页数：12

共 50 条

[1] Convolutional Neural Network Training with Distributed K-FAC
Pauloski, J. Gregory
Zhang, Zhao
Huang, Lei
Xu, Weijia
Foster, Ian T.
[J]. PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
[2] Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning
Zhang, Lin
Shi, Shaohuai
Wang, Wei
Li, Bo
[J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (03) : 2365 - 2378
[3] Inefficiency of K-FAC for Large Batch Size Training
Ma, Linjian
Montague, Gabe
Ye, Jiayu
Yao, Zhewei
Gholami, Amir
Keutzer, Kurt
Mahoney, Michael W.
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 5053 - 5060
[4] Optimizing Q-Learning with K-FAC AlgorithmOptimizing Q-Learning with K-FAC Algorithm
Beltiukov, Roman
[J]. ANALYSIS OF IMAGES, SOCIAL NETWORKS AND TEXTS (AIST 2019), 2020, 1086 : 3 - 8
[5] Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks
Shi, Shaohuai
Zhang, Lin
Li, Bo
[J]. 2021 IEEE 41ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2021), 2021, : 550 - 560
[6] Distributed Deep Neural Network Training on Edge Devices
Benditkis, Daniel
Keren, Aviv
Mor-Yosef, Liron
Avidor, Tomer
Shoham, Neta
Tal-Israel, Nadav
[J]. SEC'19: PROCEEDINGS OF THE 4TH ACM/IEEE SYMPOSIUM ON EDGE COMPUTING, 2019, : 304 - 306
[7] 基于Sherman-Morrison公式的K-FAC算法
刘小雷
高凯新
王勇
[J]. 计算机系统应用, 2021, 30 (04) : 118 - 124
[8] Randomized K-FACs: Speeding Up K-FAC with Randomized Numerical Linear Algebra
Puiu, Constantin Octavian
[J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2022, 2022, 13756 : 411 - 422
[9] Accelerating distributed deep neural network training with pipelined MPI allreduce
Castello, Adrian
Quintana-Orti, Enrique S.
Duato, Jose
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3797 - 3813
[10] Accelerating distributed deep neural network training with pipelined MPI allreduce
Adrián Castelló
Enrique S. Quintana-Ortí
José Duato
[J]. Cluster Computing, 2021, 24 : 3797 - 3813

← 1 2 3 4 5 →