Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

被引:0
|
作者
Zhang, Guodong [1 ,2 ]
Martens, James [3 ]
Grosse, Roger [1 ,2 ]
机构
[1] Univ Toronto, Toronto, ON, Canada
[2] Vector Inst, Hyderabad, Telangana, India
[3] DeepMind, London, England
关键词
ERROR;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence of natural gradient descent on nonlinear neural networks with squared-error loss. We identify two conditions which guarantee efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) has full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions, and we give a bound on the rate of this convergence.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks
    Karakida, Ryo
    Osawa, Kazuki
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [2] Understanding approximate Fisher information for fast convergence of natural gradient descent in wide neural networks*
    Karakida, Ryo
    Osawa, Kazuki
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2021, 2021 (12):
  • [3] Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
    Li, Yuanzhi
    Liang, Yingyu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [4] Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks
    Li, Mingchen
    Soltanolkotabi, Mahdi
    Oymak, Samet
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 4313 - 4324
  • [5] A Convergence Analysis of Gradient Descent on Graph Neural Networks
    Awasthi, Pranjal
    Das, Abhimanyu
    Gollapudi, Sreenivas
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] Convergence of gradient descent for learning linear neural networks
    Nguegnang, Gabin Maxime
    Rauhut, Holger
    Terstiege, Ulrich
    [J]. ADVANCES IN CONTINUOUS AND DISCRETE MODELS, 2024, 2024 (01):
  • [7] Convergence of Gradient Descent Algorithm for Diagonal Recurrent Neural Networks
    Xu, Dongpo
    Li, Zhengxue
    Wu, Wei
    Ding, Xiaoshuai
    Qu, Di
    [J]. 2007 SECOND INTERNATIONAL CONFERENCE ON BIO-INSPIRED COMPUTING: THEORIES AND APPLICATIONS, 2007, : 29 - 31
  • [8] Convergence rates for shallow neural networks learned by gradient descent
    Braun, Alina
    Kohler, Michael
    Langer, Sophie
    Walk, Harro
    [J]. BERNOULLI, 2024, 30 (01) : 475 - 502
  • [9] Analysis of natural gradient descent for multilayer neural networks
    Rattray, M
    Saad, D
    [J]. PHYSICAL REVIEW E, 1999, 59 (04): : 4523 - 4532
  • [10] Optimization of Graph Neural Networks with Natural Gradient Descent
    Izadi, Mohammad Rasool
    Fang, Yihao
    Stevenson, Robert
    Lin, Lizhen
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 171 - 179