Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

被引：0

作者：

Karandikar, Rajeeva Laxman ^{[1
]}

Vidyasagar, Mathukumalli ^{[2
]}

机构：

[1] Chennai Math Inst, Chennai, India

[2] Indian Inst Technol Hyderabad, Hyderabad, India

来源：

JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS | 2024年 / 203卷 / 03期

关键词：

Stochastic gradient descent; Stochastic approximation; Nonconvex optimization; Martingale methods;

D O I：

10.1007/s10957-024-02547-7

中图分类号：

C93 [管理学]; O22 [运筹学];

学科分类号：

070105 ; 12 ; 1201 ; 1202 ; 120202 ;

摘要：

In this paper, we study the convergence properties of the Stochastic Gradient Descent (SGD) method for finding a stationary point of a given objective function J(<middle dot>)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J(\cdot )$$\end{document}. The objective function is not required to be convex. Rather, our results apply to a class of "invex" functions, which have the property that every stationary point is also a global minimizer. First, it is assumed that J(<middle dot>)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J(\cdot )$$\end{document} satisfies a property that is slightly weaker than the Kurdyka-& Lstrok;ojasiewicz (KL) condition, denoted here as (KL'). It is shown that the iterations J(theta t)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J({\varvec{\theta }}_t)$$\end{document} converge almost surely to the global minimum of J(<middle dot>)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J(\cdot )$$\end{document}. Next, the hypothesis on J(<middle dot>)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J(\cdot )$$\end{document} is strengthened from (KL') to the Polyak-& Lstrok;ojasiewicz (PL) condition. With this stronger hypothesis, we derive estimates on the rate of convergence of J(theta t)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J({\varvec{\theta }}_t)$$\end{document} to its limit. Using these results, we show that for functions satisfying the PL property, the convergence rate of both the objective function and the norm of the gradient with SGD is the same as the best-possible rate for convex functions. While some results along these lines have been published in the past, our contributions contain two distinct improvements. First, the assumptions on the stochastic gradient are more general than elsewhere, and second, our convergence is almost sure, and not in expectation. We also study SGD when only function evaluations are permitted. In this setting, we determine the "optimal" increments or the size of the perturbations. Using the same set of ideas, we establish the global convergence of the Stochastic Approximation (SA) algorithm under more general assumptions on the measurement error, compared to the existing literature. We also derive bounds on the rate of convergence of the SA algorithm under appropriate assumptions.

引用

页码：2412 / 2450

页数：39

共 50 条

[1] Noise conditions for prespecified convergence rates of stochastic approximation algorithms
Chong, EKP
Wang, IJ
Kulkarni, SR
IEEE TRANSACTIONS ON INFORMATION THEORY, 1999, 45 (02) : 810 - 814
[2] Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance
Wang, Hongjian
Gurbuzbalaban, Mert
Zhu, Lingjiong
Simsekli, Umut
Erdogdu, Murat A.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[3] Convergence and applications of stochastic approximation with state-dependent noise
Chen, HF
PROCEEDINGS OF THE 2001 AMERICAN CONTROL CONFERENCE, VOLS 1-6, 2001, : 744 - 749
[4] On conditions for convergence rates of stochastic approximation algorithms
Chong, EKP
Wang, IJ
Kulkarni, SR
PROCEEDINGS OF THE 36TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 1997, : 2279 - 2280
[5] CONVERGENCE OF STOCHASTIC-APPROXIMATION PROCEDURES WITH DEPENDENT NOISE
CHIKIN, DO
AUTOMATION AND REMOTE CONTROL, 1988, 49 (03) : 305 - 313
[6] Sharp convergence rates of stochastic approximation for degenerate roots
Laboratory of Systems and Control, Institute of Systems Science, Chinese Academy of Sciences, Beijing 100080, China
Sci China Ser E Technol Sci, 4 (x6-392):
[7] Stochastic approximation algorithms: Nonasymptotic estimation of their convergence rates
Kul'chitskii, OY
Mozgovoi, AE
AUTOMATION AND REMOTE CONTROL, 1997, 58 (11) : 1817 - 1823
[8] CONVERGENCE RATES AND DECOUPLING IN LINEAR STOCHASTIC APPROXIMATION ALGORITHMS
Kouritzin, Michael A.
Sadeghi, Samira
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2015, 53 (03) : 1484 - 1508
[9] Sharp convergence rates of stochastic approximation for degenerate roots
Fang, HT
Chen, HF
SCIENCE IN CHINA SERIES E-TECHNOLOGICAL SCIENCES, 1998, 41 (04): : 383 - 392
[10] RATES OF CONVERGENCE FOR STOCHASTIC-APPROXIMATION TYPE ALGORITHMS
KUSHNER, HJ
HUANG, H
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1979, 17 (05) : 607 - 617

← 1 2 3 4 5 →