The Heavy-Tail Phenomenon in SGD

被引:0
|
作者
Gurbuzbalaban, Mert [1 ]
Simsekli, Umut [2 ]
Zhu, Lingjiong [3 ]
机构
[1] Rutgers Business Sch, Dept Management Sci & Informat Syst, Piscataway, NJ 08854 USA
[2] PSL Res Univ, Ecole Normale Super, INRIA, Dept Informat, Paris, France
[3] Florida State Univ, Dept Math, Tallahassee, FL 32306 USA
基金
美国国家科学基金会;
关键词
RANDOM DIFFERENCE-EQUATIONS; POWER-LAW DISTRIBUTIONS; STATIONARY SOLUTIONS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the 'flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize eta to the batch-size b, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the 'tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters eta and b, the distribution of the SGD iterates will converge to a heavy-tailed stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic data, fully connected, and convolutional neural networks.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Non-directed polymers in heavy-tail random environment in dimension d ≥ 2
    Berger, Quentin
    Torri, Niccolo
    Wei, Ran
    ELECTRONIC JOURNAL OF PROBABILITY, 2022, 27
  • [42] Two-Stream Multiplicative Heavy-Tail Noise Despeckling Network With Truncation Loss
    Cheng, Li
    Guo, Zhichang
    Li, Yao
    Xing, Yuming
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [43] Heavy-tail and voice over internet protocol traffic: queueing analysis for performance evaluation
    Mahani, A.
    Kavian, Y. S.
    Naderi, M.
    Rashvand, H. F.
    IET COMMUNICATIONS, 2011, 5 (18) : 2736 - 2743
  • [44] Maximum weighted likelihood estimator for robust heavy-tail modelling of finite mixture models
    Fung, Tsz Chai
    INSURANCE MATHEMATICS & ECONOMICS, 2022, 107 : 180 - 198
  • [45] NONLINEAR GRADIENT MAPPINGS AND STOCHASTIC OPTIMIZATION: A GENERAL FRAMEWORK WITH APPLICATIONS TO HEAVY-TAIL NOISE
    Jakovetic, Dusuan
    Bajovic, Dragana
    Sahu, Anit Kumar
    Kar, Soummya
    Milosevic, Nemanja
    Stamenkovic, Dusan
    SIAM JOURNAL ON OPTIMIZATION, 2023, 33 (02) : 394 - 423
  • [46] A New Bimodal Distribution for Modeling Asymmetric Bimodal Heavy-Tail Real Lifetime Data
    Butt, Nadeem S.
    Khalil, Mohamed G.
    SYMMETRY-BASEL, 2020, 12 (12): : 1 - 27
  • [47] Interconnected Risk Contributions: A Heavy-Tail Approach to Analyze U.S. Financial Sectors
    Bernardi, Mauro
    Petrella, Lea
    JOURNAL OF RISK AND FINANCIAL MANAGEMENT, 2015, 8 (02) : 198 - 226
  • [48] Bounds for mixing times for finite semi-Markov processes with heavy-tail jump distribution
    Nicos Georgiou
    Enrico Scalas
    Fractional Calculus and Applied Analysis, 2022, 25 : 229 - 243
  • [49] A Symmetrical Model Applied to Interval-Valued Data Containing Outliers with Heavy-Tail Distribution
    Domingues, Marco A. O.
    de Souza, Renata M. C. R.
    Cysneiros, Francisco Jose A.
    ADVANCES IN NEURO-INFORMATION PROCESSING, PT II, 2009, 5507 : 19 - 26
  • [50] Systematic inference of the long-range dependence and heavy-tail distribution parameters of ARFIMA models
    Graves, Timothy
    Franzke, Christian L. E.
    Watkins, Nicholas W.
    Gramacy, Robert B.
    Tindale, Elizabeth
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2017, 473 : 60 - 71