Gradient methods never overfit on separable data

被引：0

作者：

Shamir, Ohad ^{[1
]}

机构：

[1] Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel

来源：

Journal of Machine Learning Research | 2021年 / 22卷

基金：

欧洲研究理事会;

关键词：

Optimization - Stochastic systems - Large dataset;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data: If we run these methods for T iterations on a dataset of size m, both the empirical risk and the generalization error decrease at an essentially optimal rate of Õ(1/γ2T) up till T ≈ m, at which point the generalization error remains fixed at an essentially optimal level of Õ(1/γ2m) regardless of how large T is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness. © 2021 Ohad Shamir.

引用

共 50 条

[41] Modified Lagrangian Methods for Separable Optimization Problems
Hamdi, Abdelouahed
Mukheimer, Aiman A.
ABSTRACT AND APPLIED ANALYSIS, 2012,
[42] The Implicit Bias of AdaGrad on Separable Data
Qian, Qian
Qian, Xiaoyuan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[43] Separable dynamic programming and approximate decomposition methods
Bertsekas, Dimitri P.
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2007, 52 (05) : 911 - 916
[44] Technical Perspective Why Don't Today's Deep Nets Overfit to Their Training Data?
Arora, Sanjeev
COMMUNICATIONS OF THE ACM, 2021, 64 (03) : 106 - 106
[45] Separable approximations and decomposition methods for the augmented Lagrangian
Tappenden, Rachael
Richatrik, Peter
Bueke, Burak
OPTIMIZATION METHODS & SOFTWARE, 2015, 30 (03): : 643 - 668
[46] MULTIPOINT METHODS FOR SEPARABLE NONLINEAR NETWORKS.
Kamesam, P.V.
Meyer, R.R.
Mathematical Programming Study, 1983, (22): : 185 - 205
[47] Acceleration of normalized adaptive filtering data-reusing methods using the Tchebyshev and conjugate gradient methods
Soni, RA
Jenkins, WK
Gallivan, KA
ISCAS '98 - PROCEEDINGS OF THE 1998 INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-6, 1998, : D309 - D312
[48] Distributed Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement
Lee, Jason D.
Lin, Qihang
Ma, Tengyu
Yang, Tianbao
JOURNAL OF MACHINE LEARNING RESEARCH, 2017, 18
[49] Gradient-based defense methods for data leakage in vertical federated learning
Chang, Wenhan
Zhu, Tianqing
COMPUTERS & SECURITY, 2024, 139
[50] Inversion of the 1996 Ipswich data using binary specialization of modified gradient methods
Duchene, B
Lesselier, D
Kleinman, RE
IEEE ANTENNAS AND PROPAGATION MAGAZINE, 1997, 39 (02) : 9 - 12

← 1 2 3 4 5 →