First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

被引:0
|
作者
Thanh Huy Nguyen [1 ]
Simsekli, Umut [1 ,2 ]
Gurbuzbalaban, Mert [3 ]
Richard, Gael [1 ]
机构
[1] Telecom Paris, Inst Polytech Paris, LTCI, Paris, France
[2] Univ Oxford, Dept Stat, Oxford, England
[3] Rutgers Business Sch, Dept Management Sci & Informat Syst, New Brunswick, NJ USA
关键词
SDES DRIVEN; DIFFERENTIAL-EQUATIONS; LEVY; UNIQUENESS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using alpha-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a Levy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of 'preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares
    Raj, Anant
    Barsbey, Melih
    Gurbuzbalaban, Mert
    Zhu, Lingjiong
    Simsekli, Umut
    [J]. INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 201, 2023, 201 : 1292 - 1342
  • [2] Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise
    Danilova, M.
    [J]. DOKLADY MATHEMATICS, 2023, 108 (SUPPL 2) : S248 - S256
  • [3] Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise
    M. Danilova
    [J]. Doklady Mathematics, 2023, 108 : S248 - S256
  • [4] Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping
    Gorbunov, Eduard
    Danilova, Marina
    Gasnikov, Alexander
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [5] Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent
    Lim, Soon Hoe
    Wan, Yijun
    Simsekli, Umut
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Stochastic PDEs with heavy-tailed noise
    Chong, Carsten
    [J]. STOCHASTIC PROCESSES AND THEIR APPLICATIONS, 2017, 127 (07) : 2262 - 2280
  • [7] Analysis of stochastic gradient descent in continuous time
    Jonas Latz
    [J]. Statistics and Computing, 2021, 31
  • [8] Analysis of stochastic gradient descent in continuous time
    Latz, Jonas
    [J]. STATISTICS AND COMPUTING, 2021, 31 (04)
  • [9] The effective noise of stochastic gradient descent
    Mignacco, Francesca
    Urbani, Pierfrancesco
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2022, 2022 (08):
  • [10] Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact
    Nikita Kornilov
    Alexander Gasnikov
    Pavel Dvurechensky
    Darina Dvinskikh
    [J]. Computational Management Science, 2023, 20