Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima

被引：0

作者：

Swenson, Brian ^{[1
]}

Murray, Ryan ^{[2
]}

Poor, H. Vincent ^{[3
]}

Kar, Soummya ^{[4
]}

机构：

[1] Penn State Univ, Appl Res Lab, State Coll, PA 16801 USA

[2] North Carolina State Univ, Dept Math, Raleigh, NC 27695 USA

[3] Princeton Univ, Dept Elect & Comp Engn, Princeton, NJ 08544 USA

[4] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2022年 / 23卷

基金：

美国国家科学基金会;

关键词：

Nonconvex optimization; distributed optimization; stochastic optimization; saddle point; gradient descent; OPTIMIZATION; ALGORITHM; ADAPTATION; CONVEX; NETWORKS;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Gradient-descent (GD) based algorithms are an indispensable tool for optimizing modern machine learning models. The paper considers distributed stochastic GD (D-SGD)-a network-based variant of GD. Distributed algorithms play an important role in large-scale machine learning problems as well as the Internet of Things (IoT) and related applications. The paper considers two main issues. First, we study convergence of D-SGD to critical points when the loss function is nonconvex and nonsmooth. We consider a broad range of nonsmooth loss functions including those of practical interest in modern deep learning. It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. It is well known that classical GD avoids saddle points; however, analogous results have been absent for distributed variants of GD. For this problem, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation.

引用

页数：62

共 50 条

[1] Distributed Gradient Flow: Nonsmoothness, Nonconvexity, and Saddle Point Evasion
Swenson, Brian
Murray, Ryan
Poor, H. Vincent
Kar, Soummya
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2022, 67 (08) : 3949 - 3964
[2] Convergence analysis of distributed stochastic gradient descent with shuffling
Meng, Qi
Chen, Wei
Wang, Yue
Ma, Zhi-Ming
Liu, Tie-Yan
[J]. NEUROCOMPUTING, 2019, 337 : 46 - 57
[3] On the Convergence of Local Stochastic Compositional Gradient Descent with Momentum
Gao, Hongchang
Li, Junyi
Huang, Heng
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[4] Convergence in High Probability of Distributed Stochastic Gradient Descent Algorithms
Lu, Kaihong
Wang, Hongxia
Zhang, Huanshui
Wang, Long
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2024, 69 (04) : 2189 - 2204
[5] Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit
Mitra, Partha P.
[J]. 2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2018, : 1890 - 1894
[6] LOCAL MINIMA ESCAPE TRANSIENTS BY STOCHASTIC GRADIENT DESCENT ALGORITHMS IN BLIND ADAPTIVE EQUALIZERS
FRATER, MR
BITMEAD, RR
JOHNSON, CR
[J]. AUTOMATICA, 1995, 31 (04) : 637 - 641
[7] Convergence of Stochastic Gradient Descent for PCA
Shamir, Ohad
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
[8] Local Stochastic Gradient Descent Ascent: Convergence Analysis and Communication Efficiency
Deng, Yuyang
Mandavi, Mehrdad
[J]. 24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
[9] Local Stochastic Factored Gradient Descent for Distributed Quantum State Tomography
Kim, Junhyung Lyle
Toghani, Mohammad Taha
Uribe, Cesar A.
Kyrillidis, Anastasios
[J]. IEEE CONTROL SYSTEMS LETTERS, 2022, 7 : 199 - 204
[10] Bayesian Distributed Stochastic Gradient Descent
Teng, Michael
Wood, Frank
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31

← 1 2 3 4 5 →