Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima

被引：0

作者：

Swenson, Brian ^{[1
]}

Murray, Ryan ^{[2
]}

Poor, H. Vincent ^{[3
]}

Kar, Soummya ^{[4
]}

机构：

[1] Penn State Univ, Appl Res Lab, State Coll, PA 16801 USA

[2] North Carolina State Univ, Dept Math, Raleigh, NC 27695 USA

[3] Princeton Univ, Dept Elect & Comp Engn, Princeton, NJ 08544 USA

[4] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2022年 / 23卷

基金：

美国国家科学基金会;

关键词：

Nonconvex optimization; distributed optimization; stochastic optimization; saddle point; gradient descent; OPTIMIZATION; ALGORITHM; ADAPTATION; CONVEX; NETWORKS;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Gradient-descent (GD) based algorithms are an indispensable tool for optimizing modern machine learning models. The paper considers distributed stochastic GD (D-SGD)-a network-based variant of GD. Distributed algorithms play an important role in large-scale machine learning problems as well as the Internet of Things (IoT) and related applications. The paper considers two main issues. First, we study convergence of D-SGD to critical points when the loss function is nonconvex and nonsmooth. We consider a broad range of nonsmooth loss functions including those of practical interest in modern deep learning. It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. It is well known that classical GD avoids saddle points; however, analogous results have been absent for distributed variants of GD. For this problem, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation.

引用

页数：62

共 50 条

[31] On the Convergence of Stochastic Compositional Gradient Descent Ascent Method
Gao, Hongchang
Wang, Xiaoqian
Luo, Lei
Shi, Xinghua
[J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 2389 - 2395
[32] Convergence of Stochastic Gradient Descent in Deep Neural Network
Bai-cun ZHOU
Cong-ying HAN
Tian-de GUO
[J]. Acta Mathematicae Applicatae Sinica, 2021, 37 (01) : 126 - 136
[33] Local gain adaptation in stochastic gradient descent
Schraudolph, NN
[J]. NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (ICANN99), VOLS 1 AND 2, 1999, (470): : 569 - 574
[34] Techniques for avoiding local minima in gradient descent based ID algorithms
Brierton, JL
[J]. RADAR SENSOR TECHNOLOGY II, 1997, 3066 : 130 - 135
[35] Distributed and asynchronous Stochastic Gradient Descent with variance reduction
Ming, Yuewei
Zhao, Yawei
Wu, Chengkun
Li, Kuan
Yin, Jianping
[J]. NEUROCOMPUTING, 2018, 281 : 27 - 36
[36] Distributed Stochastic Gradient Descent With Compressed and Skipped Communication
Phuong, Tran Thi
Phong, Le Trieu
Fukushima, Kazuhide
[J]. IEEE ACCESS, 2023, 11 : 99836 - 99846
[37] Distributed Stochastic Gradient Descent Using LDGM Codes
Horii, Shunsuke
Yoshida, Takahiro
Kobayashi, Manabu
Matsushima, Toshiyasu
[J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2019, : 1417 - 1421
[38] Communication-Censored Distributed Stochastic Gradient Descent
Li, Weiyu
Wu, Zhaoxian
Chen, Tianyi
Li, Liping
Ling, Qing
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6831 - 6843
[39] On Convergence of Gradient Descent Ascent: A Tight Local Analysis
Li, Haochuan
Farnia, Farzan
Das, Subhro
Jadbabaie, Ali
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[40] ON DESCENT FROM LOCAL MINIMA
GOLDSTEIN AA
PRICE, JF
[J]. MATHEMATICS OF COMPUTATION, 1971, 25 (115) : 569 - 574

← 1 2 3 4 5 →