On the diffusion approximation of nonconvex stochastic gradient descent

被引:31
|
作者
Hu, Wenqing [1 ]
Li, Chris Junchi [2 ]
Li, Lei [3 ]
Liu, Jian-Guo [3 ,4 ]
机构
[1] Missouri Univ Sci & Technol, Dept Math & Stat, Rolla, MO USA
[2] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[3] Duke Univ, Dept Math, Durham, NC 27708 USA
[4] Duke Univ, Dept Phys, Durham, NC 27708 USA
关键词
Nonconvex optimization; stochastic gradient descent; diffusion approximation; stationary points; batch size; EIGENVALUE; OPERATORS; BEHAVIOR;
D O I
10.4310/AMSA.2019.v4.n1.a1
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
We study the stochastic gradient descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp. saddle point): it escapes in a number of iterations exponentially (resp. almost linearly) dependent on the inverse stepsize. The results are obtained using the theory for random perturbations of dynamical systems (theory of large deviations for local minimizers and theory of exiting for unstable stationary points). In addition, we discuss the effects of batch size for the deep neural networks, and we find that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers. Our theory indicates that using small batch size at earlier stage and increasing the batch size at later stage is helpful for the SGD to be trapped in flat minimizers for better generalization.
引用
收藏
页码:3 / 32
页数:30
相关论文
共 50 条
  • [21] Preconditioned Stochastic Gradient Descent
    Li, Xi-Lin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (05) : 1454 - 1466
  • [22] Stochastic gradient descent tricks
    Bottou, Léon
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012, 7700 LECTURE NO : 421 - 436
  • [23] Stochastic Reweighted Gradient Descent
    El Hanchi, Ayoub
    Stephens, David A.
    Maddison, Chris J.
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [24] Byzantine Stochastic Gradient Descent
    Alistarh, Dan
    Allen-Zhu, Zeyuan
    Li, Jerry
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [25] Quantized Gradient Descent Algorithm for Distributed Nonconvex Optimization
    Yoshida, Junya
    Hayashi, Naoki
    Takai, Shigemasa
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2023, E106A (10) : 1297 - 1304
  • [26] Gradient Descent with Proximal Average for Nonconvex and Composite Regularization
    Zhong, Leon Wenliang
    Kwok, James T.
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 2206 - 2212
  • [27] Accelerate Distributed Stochastic Descent for Nonconvex Optimization with Momentum
    Cong, Guojing
    Liu, Tianyi
    [J]. 2020 IEEE/ACM WORKSHOP ON MACHINE LEARNING IN HIGH PERFORMANCE COMPUTING ENVIRONMENTS (MLHPC 2020) AND WORKSHOP ON ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FOR SCIENTIFIC APPLICATIONS (AI4S 2020), 2020, : 29 - 39
  • [28] A Gradient Descent Approximation for Graph Cuts
    Yildiz, Alparslan
    Akgul, Yusuf Sinan
    [J]. PATTERN RECOGNITION, PROCEEDINGS, 2009, 5748 : 312 - 321
  • [29] STOCHASTIC ALTERNATING STRUCTURE-ADAPTED PROXIMAL GRADIENT DESCENT METHOD WITH VARIANCE REDUCTION FOR NONCONVEX NONSMOOTH OPTIMIZATION
    Jia, Zehui
    Zhang, Wenxing
    Cai, Xingju
    Han, Deren
    [J]. MATHEMATICS OF COMPUTATION, 2024, 93 (348) : 1677 - 1714
  • [30] NONCONVEX SPARSE LOGISTIC REGRESSION VIA PROXIMAL GRADIENT DESCENT
    Shen, Xinyue
    Gu, Yuantao
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4079 - 4083