Projective Fisher Information for Natural Gradient Descent

被引：1

作者：

Kaul, Piyush ^{[1
]}

Lall, Brejesh ^{[1
]}

机构：

[1] Indian Institute of Technology - Delhi, Department of Electrical Engineering, New Delhi,110016, India

来源：

IEEE Transactions on Artificial Intelligence | 2023年 / 4卷 / 02期

关键词：

Complex networks - Covariance matrix - Deep neural networks - Fisher information matrix - Gradient methods - Learning algorithms;

D O I：

10.1109/TAI.2022.3153593

中图分类号：

学科分类号：

摘要：

Improvements in neural network optimization algorithms have enabled shorter training times and the ability to reach state-of-the-art performance on various machine learning tasks. Fisher information based natural gradient descent is one such second-order method that improves the convergence speed and the final performance metric achieved for many machine learning algorithms. Fisher information matrices are also helpful to analyze the properties and expected behavior of neural networks. However, natural gradient descent is a high complexity method due to the need to maintain and invert covariance matrices. This is especially the case with modern deep neural networks, which have a very high number of parameters, and for which the problem often becomes computationally unfeasible. We suggest using the Fisher information for analysis of parameter space of fully connected and convolutional neural networks without calculating the matrix itself. We also propose a lower complexity natural gradient descent algorithm based on the projection of Kronecker factors of Fisher information combined with recursive calculation of inverses, which is computationally less complex and more stable. We finally share analysis and results showing that all these optimizations do not impact the accuracy while considerably lowering the optimization process's complexity. These improvements should enable applying natural gradient descent methods for optimization to neural networks with a larger number of parameters, than possible previously. © 2020 IEEE.

引用

页码：304 / 314

共 50 条

[21] Kernel gradient descent algorithm for information theoretic learning
Hu, Ting
Wu, Qiang
Zhou, Ding-Xuan
[J]. JOURNAL OF APPROXIMATION THEORY, 2021, 263
[22] Gradient descent optimization of smoothed information retrieval metrics
Chapelle, Olivier
Wu, Mingrui
[J]. INFORMATION RETRIEVAL, 2010, 13 (03): : 216 - 235
[23] Gradient descent optimization of smoothed information retrieval metrics
Olivier Chapelle
Mingrui Wu
[J]. Information Retrieval, 2010, 13 : 216 - 235
[24] Fisher information regularization schemes for Wasserstein gradient flows
Li, Wuchen
Lu, Jianfeng
Wang, Li
[J]. JOURNAL OF COMPUTATIONAL PHYSICS, 2020, 416
[25] Information cut for clustering using a gradient descent approach
Jenssen, Robert
Erdogmus, Deniz
Hild, Kenneth E., II
Principe, Jose C.
Eltoft, Torbjorn
[J]. PATTERN RECOGNITION, 2007, 40 (03) : 796 - 806
[26] Learning to learn by gradient descent by gradient descent
Andrychowicz, Marcin
Denil, Misha
Colmenarejo, Sergio Gomez
Hoffman, Matthew W.
Pfau, David
Schaul, Tom
Shillingford, Brendan
de Freitas, Nando
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[27] Holonomic gradient descent for the Fisher-Bingham distribution on the d-dimensional sphere
Koyama, Tamio
Nakayama, Hiromasa
Nishiyama, Kenta
Takayama, Nobuki
[J]. COMPUTATIONAL STATISTICS, 2014, 29 (3-4) : 661 - 683
[28] Coordinated gradient descent: A case study of Lagrangian dynamics with projected gradient information
Moreau, L
Bachmayer, R
Leonard, NE
[J]. LAGRANGIAN AND HAMILTONIAN METHODS IN NONLINEAR CONTROL 2003, 2003, : 57 - 62
[29] Learning to Learn without Gradient Descent by Gradient Descent
Chen, Yutian
Hoffman, Matthew W.
Colmenarejo, Sergio Gomez
Denil, Misha
Lillicrap, Timothy P.
Botvinick, Matt
de Freitas, Nando
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[30] Phase space gradient of dissipated work and information: A role of relative Fisher information
Yamano, Takuya
[J]. JOURNAL OF MATHEMATICAL PHYSICS, 2013, 54 (11)

← 1 2 3 4 5 →