Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

被引：0

作者：

Zhang, Guodong ^{[1
,2
]}

Martens, James ^{[3
]}

Grosse, Roger ^{[1
,2
]}

机构：

[1] Univ Toronto, Toronto, ON, Canada

[2] Vector Inst, Hyderabad, Telangana, India

[3] DeepMind, London, England

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷

关键词：

ERROR;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions, and we give a bound on the rate of this convergence.

引用

页数：12

共 50 条

[31] Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?
Oymak, Samet
Soltanolkotabi, Mandi
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[32] A Fast Adaptive Online Gradient Descent Algorithm in Over-Parameterized Neural Networks
Yang, Anni
Li, Dequan
Li, Guangxiang
[J]. NEURAL PROCESSING LETTERS, 2023, 55 (04) : 4641 - 4659
[33] A Fast Adaptive Online Gradient Descent Algorithm in Over-Parameterized Neural Networks
Anni Yang
Dequan Li
Guangxiang Li
[J]. Neural Processing Letters, 2023, 55 : 4641 - 4659
[34] Applying Gradient Descent in Convolutional Neural Networks
Cui, Nan
[J]. 2ND INTERNATIONAL CONFERENCE ON MACHINE VISION AND INFORMATION TECHNOLOGY (CMVIT 2018), 2018, 1004
[35] Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit
Mitra, Partha P.
[J]. 2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 1890 - 1894
[36] Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks
Shamir, Ohad
[J]. CONFERENCE ON LEARNING THEORY, VOL 99, 2019, 99
[37] Natural Gradient Descent of Complex-Valued Neural Networks Invariant under Rotations
Mukuno, Jun-ichi
Matsui, Hajime
[J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2019, E102A (12) : 1988 - 1996
[38] Global Convergence of Gradient Descent for Deep Linear Residual Networks
Wu, Lei
Wang, Qingcan
Ma, Chao
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[39] Study on fast speed fractional order gradient descent method and its application in neural networks
Wang, Yong
He, Yuli
Zhu, Zhiguang
[J]. NEUROCOMPUTING, 2022, 489 : 366 - 376
[40] Overparametrized Multi-layer Neural Networks: Uniform Concentration of Neural Tangent Kernel and Convergence of Stochastic Gradient Descent
Xu, Jiaming
Zhu, Hanjing
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 83

← 1 2 3 4 5 →