Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima

被引:0
|
作者
Swenson, Brian [1 ]
Murray, Ryan [2 ]
Poor, H. Vincent [3 ]
Kar, Soummya [4 ]
机构
[1] Penn State Univ, Appl Res Lab, State Coll, PA 16801 USA
[2] North Carolina State Univ, Dept Math, Raleigh, NC 27695 USA
[3] Princeton Univ, Dept Elect & Comp Engn, Princeton, NJ 08544 USA
[4] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
Nonconvex optimization; distributed optimization; stochastic optimization; saddle point; gradient descent; OPTIMIZATION; ALGORITHM; ADAPTATION; CONVEX; NETWORKS;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Gradient-descent (GD) based algorithms are an indispensable tool for optimizing modern machine learning models. The paper considers distributed stochastic GD (D-SGD)-a network-based variant of GD. Distributed algorithms play an important role in large-scale machine learning problems as well as the Internet of Things (IoT) and related applications. The paper considers two main issues. First, we study convergence of D-SGD to critical points when the loss function is nonconvex and nonsmooth. We consider a broad range of nonsmooth loss functions including those of practical interest in modern deep learning. It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. It is well known that classical GD avoids saddle points; however, analogous results have been absent for distributed variants of GD. For this problem, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation.
引用
收藏
页数:62
相关论文
共 50 条
  • [31] On the Convergence of Stochastic Compositional Gradient Descent Ascent Method
    Gao, Hongchang
    Wang, Xiaoqian
    Luo, Lei
    Shi, Xinghua
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 2389 - 2395
  • [32] Convergence of Stochastic Gradient Descent in Deep Neural Network
    Bai-cun ZHOU
    Cong-ying HAN
    Tian-de GUO
    [J]. Acta Mathematicae Applicatae Sinica, 2021, 37 (01) : 126 - 136
  • [33] Local gain adaptation in stochastic gradient descent
    Schraudolph, NN
    [J]. NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (ICANN99), VOLS 1 AND 2, 1999, (470): : 569 - 574
  • [34] Techniques for avoiding local minima in gradient descent based ID algorithms
    Brierton, JL
    [J]. RADAR SENSOR TECHNOLOGY II, 1997, 3066 : 130 - 135
  • [35] Distributed and asynchronous Stochastic Gradient Descent with variance reduction
    Ming, Yuewei
    Zhao, Yawei
    Wu, Chengkun
    Li, Kuan
    Yin, Jianping
    [J]. NEUROCOMPUTING, 2018, 281 : 27 - 36
  • [36] Distributed Stochastic Gradient Descent With Compressed and Skipped Communication
    Phuong, Tran Thi
    Phong, Le Trieu
    Fukushima, Kazuhide
    [J]. IEEE ACCESS, 2023, 11 : 99836 - 99846
  • [37] Distributed Stochastic Gradient Descent Using LDGM Codes
    Horii, Shunsuke
    Yoshida, Takahiro
    Kobayashi, Manabu
    Matsushima, Toshiyasu
    [J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2019, : 1417 - 1421
  • [38] Communication-Censored Distributed Stochastic Gradient Descent
    Li, Weiyu
    Wu, Zhaoxian
    Chen, Tianyi
    Li, Liping
    Ling, Qing
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6831 - 6843
  • [39] On Convergence of Gradient Descent Ascent: A Tight Local Analysis
    Li, Haochuan
    Farnia, Farzan
    Das, Subhro
    Jadbabaie, Ali
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [40] ON DESCENT FROM LOCAL MINIMA
    GOLDSTEIN AA
    PRICE, JF
    [J]. MATHEMATICS OF COMPUTATION, 1971, 25 (115) : 569 - 574