The Heavy-Tail Phenomenon in SGD

被引:0
|
作者
Gurbuzbalaban, Mert [1 ]
Simsekli, Umut [2 ]
Zhu, Lingjiong [3 ]
机构
[1] Rutgers Business Sch, Dept Management Sci & Informat Syst, Piscataway, NJ 08854 USA
[2] PSL Res Univ, Ecole Normale Super, INRIA, Dept Informat, Paris, France
[3] Florida State Univ, Dept Math, Tallahassee, FL 32306 USA
基金
美国国家科学基金会;
关键词
RANDOM DIFFERENCE-EQUATIONS; POWER-LAW DISTRIBUTIONS; STATIONARY SOLUTIONS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the 'flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize eta to the batch-size b, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the 'tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters eta and b, the distribution of the SGD iterates will converge to a heavy-tailed stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic data, fully connected, and convolutional neural networks.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Heavy-Tail Phenomenon in Decentralized SGD
    Gurbuzbalaban, Mert
    Hu, Yuanhan
    Simsekli, Umut
    Yuan, Kun
    Zhu, Lingjiong
    IISE TRANSACTIONS, 2024,
  • [2] On the foundations of multivariate heavy-tail analysis
    Resnick, S
    JOURNAL OF APPLIED PROBABILITY, 2004, 41A : 191 - 212
  • [3] On the Heavy-Tail Behavior of the Distributionally Robust Newsvendor
    Das, Bikramjit
    Dhara, Anulekha
    Natarajan, Karthik
    OPERATIONS RESEARCH, 2021, 69 (04) : 1077 - 1099
  • [4] Aggregation of Dependent Risks with Heavy-Tail Distributions
    Guillen, Montserrat
    Maria Sarabia, Jose
    Prieto, Faustino
    Jorda, Vanesa
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2019, 27 (Suppl.1) : 77 - 88
  • [5] Estimating the heavy-tail index for WWW traces
    Ramirez Pacheco, J. C.
    2006 3rd International Conference on Electrical and Electronics Engineering, 2006, : 365 - 368
  • [6] When do heavy-tail distributions help?
    Hansen, Nikolaus
    Gemperle, Fabian
    Auger, Anne
    Koumoutsakos, Petros
    PARALLEL PROBLEM SOLVING FROM NATURE - PPSN IX, PROCEEDINGS, 2006, 4193 : 62 - 71
  • [7] DIRECTED POLYMERS IN HEAVY-TAIL RANDOM ENVIRONMENT
    Berger, Quentin
    Torri, Niccolo
    ANNALS OF PROBABILITY, 2019, 47 (06): : 4024 - 4076
  • [8] SKEW AND HEAVY-TAIL EFFECTS ON FIRM PERFORMANCE
    Makino, Shige
    Chan, Christine M.
    STRATEGIC MANAGEMENT JOURNAL, 2017, 38 (08) : 1721 - 1740
  • [9] Heavy-tail phenomena. Probabilistic and statistical modeling
    Viharos, Laszlo
    ACTA SCIENTIARUM MATHEMATICARUM, 2008, 74 (1-2): : 472 - 473
  • [10] Self-Similarity and Heavy-Tail of ICMP Traffic
    Liu, Wai-xi
    Yan, Yi-er
    Dong, Tang
    Tang Run-hua
    JOURNAL OF COMPUTERS, 2012, 7 (12) : 2948 - 2954