On the diffusion approximation of nonconvex stochastic gradient descent

被引:31
|
作者
Hu, Wenqing [1 ]
Li, Chris Junchi [2 ]
Li, Lei [3 ]
Liu, Jian-Guo [3 ,4 ]
机构
[1] Missouri Univ Sci & Technol, Dept Math & Stat, Rolla, MO USA
[2] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[3] Duke Univ, Dept Math, Durham, NC 27708 USA
[4] Duke Univ, Dept Phys, Durham, NC 27708 USA
关键词
Nonconvex optimization; stochastic gradient descent; diffusion approximation; stationary points; batch size; EIGENVALUE; OPERATORS; BEHAVIOR;
D O I
10.4310/AMSA.2019.v4.n1.a1
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
We study the stochastic gradient descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp. saddle point): it escapes in a number of iterations exponentially (resp. almost linearly) dependent on the inverse stepsize. The results are obtained using the theory for random perturbations of dynamical systems (theory of large deviations for local minimizers and theory of exiting for unstable stationary points). In addition, we discuss the effects of batch size for the deep neural networks, and we find that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers. Our theory indicates that using small batch size at earlier stage and increasing the batch size at later stage is helpful for the SGD to be trapped in flat minimizers for better generalization.
引用
收藏
页码:3 / 32
页数:30
相关论文
共 50 条
  • [41] Bayesian Distributed Stochastic Gradient Descent
    Teng, Michael
    Wood, Frank
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [42] Stochastic generalized gradient method for nonconvex nonsmooth stochastic optimization
    Yu. M. Ermol'ev
    V. I. Norkin
    [J]. Cybernetics and Systems Analysis, 1998, 34 : 196 - 215
  • [43] On the discrepancy principle for stochastic gradient descent
    Jahn, Tim
    Jin, Bangti
    [J]. INVERSE PROBLEMS, 2020, 36 (09)
  • [44] Nonparametric Budgeted Stochastic Gradient Descent
    Trung Le
    Vu Nguyen
    Tu Dinh Nguyen
    Dinh Phung
    [J]. ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 51, 2016, 51 : 564 - 572
  • [45] Benign Underfitting of Stochastic Gradient Descent
    Koren, Tomer
    Livni, Roi
    Mansour, Yishay
    Sherman, Uri
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [46] ON THE GLOBAL CONVERGENCE OF RANDOMIZED COORDINATE GRADIENT DESCENT FOR NONCONVEX OPTIMIZATION
    Chen, Ziang
    Li, Yingzhou
    Lu, Jianfeng
    [J]. SIAM JOURNAL ON OPTIMIZATION, 2023, 33 (02) : 713 - 738
  • [47] The effective noise of stochastic gradient descent
    Mignacco, Francesca
    Urbani, Pierfrancesco
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2022, 2022 (08):
  • [48] On the regularizing property of stochastic gradient descent
    Jin, Bangti
    Lu, Xiliang
    [J]. INVERSE PROBLEMS, 2019, 35 (01)
  • [49] Conjugate directions for stochastic gradient descent
    Schraudolph, NN
    Graepel, T
    [J]. ARTIFICIAL NEURAL NETWORKS - ICANN 2002, 2002, 2415 : 1351 - 1356
  • [50] A stochastic multiple gradient descent algorithm
    Mercier, Quentin
    Poirion, Fabrice
    Desideri, Jean-Antoine
    [J]. EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2018, 271 (03) : 808 - 817