The Heavy-Tail Phenomenon in SGD

被引:0
|
作者
Gurbuzbalaban, Mert [1 ]
Simsekli, Umut [2 ]
Zhu, Lingjiong [3 ]
机构
[1] Rutgers Business Sch, Dept Management Sci & Informat Syst, Piscataway, NJ 08854 USA
[2] PSL Res Univ, Ecole Normale Super, INRIA, Dept Informat, Paris, France
[3] Florida State Univ, Dept Math, Tallahassee, FL 32306 USA
基金
美国国家科学基金会;
关键词
RANDOM DIFFERENCE-EQUATIONS; POWER-LAW DISTRIBUTIONS; STATIONARY SOLUTIONS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the 'flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize eta to the batch-size b, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the 'tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters eta and b, the distribution of the SGD iterates will converge to a heavy-tailed stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic data, fully connected, and convolutional neural networks.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] A New Probability Heavy-Tail Model for Stochastic Modeling under Engineering Data
    El-Morshedy, M.
    Eliwa, M. S.
    Al-Bossly, Afrah
    Yousof, Haitham M.
    JOURNAL OF MATHEMATICS, 2022, 2022
  • [32] Generalized Wiener estimation algorithms based on a family of heavy-tail distributions
    Deng, G
    2005 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), VOLS 1-5, 2005, : 261 - 264
  • [33] Risk measure method expect return under heavy-tail distribution
    Wu, Qing-Xiao
    Liu, Hai-Long
    Xu, You-Chuan
    Shanghai Jiaotong Daxue Xuebao/Journal of Shanghai Jiaotong University, 2009, 43 (04): : 521 - 525
  • [34] Long-range dependence and heavy-tail modeling for teletraffic data
    Cappé, O
    Moulines, E
    Pesquet, JC
    Petropulu, AP
    Yang, XS
    IEEE SIGNAL PROCESSING MAGAZINE, 2002, 19 (03) : 14 - 27
  • [35] Conditional heavy-tail behavior with applications to precipitation and river flow extremes
    Kinsvater, Paul
    Fried, Roland
    STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT, 2017, 31 (05) : 1155 - 1169
  • [36] On the estimation of the heavy-tail exponent in time series using the max-spectrum
    Stoev, Stilian A.
    Michailidis, George
    APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 2010, 26 (03) : 224 - 253
  • [37] Censoring heavy-tail count distributions for parameter estimation with an application to stable distributions
    Di Noia, Antonio
    Marcheselli, Marzia
    Pisani, Caterina
    Pratelli, Luca
    STATISTICS & PROBABILITY LETTERS, 2023, 202
  • [38] Estimation for a first-order bifurcating autoregressive process with heavy-tail innovations
    Bartlett, A.
    McCormick, W. P.
    STOCHASTIC MODELS, 2017, 33 (02) : 210 - 228
  • [39] Geometric Approximations of Heavy-Tail Effects for Chi-Square Integrity Monitors
    Rife, Jason H.
    Parker, John Scott
    PROCEEDINGS OF THE 2017 INTERNATIONAL TECHNICAL MEETING OF THE INSTITUTE OF NAVIGATION, 2017, : 536 - 561
  • [40] Comparing Geometric Approximations of Heavy-Tail Effects for Chi-Square Integrity Monitors
    Rife, Jason H.
    Parker, John Scott
    NAVIGATION-JOURNAL OF THE INSTITUTE OF NAVIGATION, 2018, 65 (03): : 363 - 376