Gradient methods never overfit on separable data

被引:0
|
作者
Shamir, Ohad [1 ]
机构
[1] Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
基金
欧洲研究理事会;
关键词
Optimization - Stochastic systems - Large dataset;
D O I
暂无
中图分类号
学科分类号
摘要
A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data: If we run these methods for T iterations on a dataset of size m, both the empirical risk and the generalization error decrease at an essentially optimal rate of Õ(1/γ2T) up till T ≈ m, at which point the generalization error remains fixed at an essentially optimal level of Õ(1/γ2m) regardless of how large T is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness. © 2021 Ohad Shamir.
引用
收藏
相关论文
共 50 条
  • [41] Modified Lagrangian Methods for Separable Optimization Problems
    Hamdi, Abdelouahed
    Mukheimer, Aiman A.
    ABSTRACT AND APPLIED ANALYSIS, 2012,
  • [42] The Implicit Bias of AdaGrad on Separable Data
    Qian, Qian
    Qian, Xiaoyuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [43] Separable dynamic programming and approximate decomposition methods
    Bertsekas, Dimitri P.
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2007, 52 (05) : 911 - 916
  • [45] Separable approximations and decomposition methods for the augmented Lagrangian
    Tappenden, Rachael
    Richatrik, Peter
    Bueke, Burak
    OPTIMIZATION METHODS & SOFTWARE, 2015, 30 (03): : 643 - 668
  • [46] MULTIPOINT METHODS FOR SEPARABLE NONLINEAR NETWORKS.
    Kamesam, P.V.
    Meyer, R.R.
    Mathematical Programming Study, 1983, (22): : 185 - 205
  • [47] Acceleration of normalized adaptive filtering data-reusing methods using the Tchebyshev and conjugate gradient methods
    Soni, RA
    Jenkins, WK
    Gallivan, KA
    ISCAS '98 - PROCEEDINGS OF THE 1998 INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-6, 1998, : D309 - D312
  • [48] Distributed Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement
    Lee, Jason D.
    Lin, Qihang
    Ma, Tengyu
    Yang, Tianbao
    JOURNAL OF MACHINE LEARNING RESEARCH, 2017, 18
  • [49] Gradient-based defense methods for data leakage in vertical federated learning
    Chang, Wenhan
    Zhu, Tianqing
    COMPUTERS & SECURITY, 2024, 139
  • [50] Inversion of the 1996 Ipswich data using binary specialization of modified gradient methods
    Duchene, B
    Lesselier, D
    Kleinman, RE
    IEEE ANTENNAS AND PROPAGATION MAGAZINE, 1997, 39 (02) : 9 - 12