A random matrix theory approach to damping in deep learning

被引:1
|
作者
Granziol, Diego [1 ]
Baskerville, Nicholas [2 ]
机构
[1] Huawei AI Theory, London, England
[2] Univ Bristol, Bristol, Avon, England
来源
JOURNAL OF PHYSICS-COMPLEXITY | 2022年 / 3卷 / 02期
关键词
deep learning; adaptive gradient optimisation; random matrices; generalisation; NONLINEAR SHRINKAGE; LARGEST EIGENVALUE; COVARIANCE; ESTIMATOR;
D O I
10.1088/2632-072X/ac730c
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
We conjecture that the inherent difference in generalisation between adaptive and non-adaptive gradient methods in deep learning stems from the increased estimation noise in the flattest directions of the true loss surface. We demonstrate that typical schedules used for adaptive methods (with low numerical stability or damping constants) serve to bias relative movement towards flat directions relative to sharp directions, effectively amplifying the noise-to-signal ratio and harming generalisation. We further demonstrate that the numerical damping constant used in these methods can be decomposed into a learning rate reduction and linear shrinkage of the estimated curvature matrix. We then demonstrate significant generalisation improvements by increasing the shrinkage coefficient, closing the generalisation gap entirely in both logistic regression and several deep neural network experiments. Extending this line further, we develop a novel random matrix theory based damping learner for second order optimisers inspired by linear shrinkage estimation. We experimentally demonstrate our learner to be very insensitive to the initialised value and to allow for extremely fast convergence in conjunction with continued stable training and competitive generalisation. We also find that our derived method works well with adaptive gradient methods such as Adam.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Nonlinear random matrix theory for deep learning
    Pennington, Jeffrey
    Worah, Pratik
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [2] Nonlinear random matrix theory for deep learning
    Pennington, Jeffrey
    Worah, Pratik
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2019, 2019 (12):
  • [3] Appearance of Random Matrix Theory in deep learning
    Baskerville, Nicholas P.
    Granziol, Diego
    Keating, Jonathan P.
    [J]. PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2022, 590
  • [4] Dynamical approach to random matrix theory
    Tao, Terence
    [J]. BULLETIN OF THE AMERICAN MATHEMATICAL SOCIETY, 2020, 57 (01) : 161 - 169
  • [5] The Dynamics of Learning: A Random Matrix Approach
    Liao, Zhenyu
    Couillet, Romain
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [6] A random energy approach to deep learning
    Xie, Rongrong
    Marsili, Matteo
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2022, 2022 (07):
  • [7] Measurement Selection: A Random Matrix Theory Approach
    Elkhalil, Khalil
    Kammoun, Abla
    Al-Naffouri, Tareq Y.
    Alouini, Mohamed-Slim
    [J]. IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2018, 17 (07) : 4899 - 4911
  • [9] Large-dimensional random matrix theory and its applications in deep learning and wireless communications
    Ge, Jungang
    Liang, Ying-Chang
    Bai, Zhidong
    Pan, Guangming
    [J]. RANDOM MATRICES-THEORY AND APPLICATIONS, 2021, 10 (04)
  • [10] Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training
    Granziol, Diego
    Zohren, Stefan
    Roberts, Stephen
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23