Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration

被引:34
|
作者
Mei, Song [1 ]
Misiakiewicz, Theodor [2 ]
Montanari, Andrea [2 ,3 ]
机构
[1] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[2] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
[3] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA
关键词
Random features; Kernel methods; Generalization error; High dimensional limit; INEQUALITIES;
D O I
10.1016/j.acha.2021.12.003
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Consider the classical supervised learning problem: we are given data (yi,xi), i <= n, with yi a response and x(i) is an element of X a covariates vector, and try to learn a model f : X -> R to predict future responses. Random features methods map the covariates vector x(i) to a point phi(x(i)) in a higher dimensional space R-N, via a random featurization map phi. We study the use of random features methods in conjunction with ridge regression in the feature space R-N. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: (1) What is the generalization error of KRR? (2) How big N should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top l eigenfunctions of the kernel, where l depends on the sample size n. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as N <= n(1-delta) for some delta > 0. We characterize this gap. For N >= n(1+delta), random features achieve the same error as the corresponding KRR, and further increasing N does not lead to a significant change in test error. (c) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页码:3 / 84
页数:82
相关论文
共 50 条
  • [21] Kernel conjugate gradient methods with random projections
    Lin, Junhong
    Cevher, Volkan
    APPLIED AND COMPUTATIONAL HARMONIC ANALYSIS, 2021, 55 : 223 - 269
  • [22] Accurate Probabilistic Error Bound for Eigenvalues of Kernel Matrix
    Jia, Lei
    Liao, Shizhong
    ADVANCES IN MACHINE LEARNING, PROCEEDINGS, 2009, 5828 : 162 - 175
  • [23] Parallel approximate matrix factorization for kernel methods
    Zhu, Kaihua
    Cui, Hang
    Bai, Hongjie
    Li, Jian
    Qiu, Zhihuan
    Wang, Hao
    Xu, Hui
    Chang, Edward Y.
    2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 1275 - 1278
  • [24] The Slow Deterioration of the Generalization Error of the Random Feature Model
    Ma, Chao
    Wu, Lei
    Weinan, E.
    MATHEMATICAL AND SCIENTIFIC MACHINE LEARNING, VOL 107, 2020, 107 : 373 - +
  • [25] Fast generalization error bound of deep learning from a kernel perspective
    Suzuki, Taiji
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84, 2018, 84
  • [26] A sparsity driven kernel machine based on minimizing a generalization error bound
    Peleg, Dori
    Meir, Ron
    PATTERN RECOGNITION, 2009, 42 (11) : 2607 - 2614
  • [27] THE TOP EIGENVALUE OF THE RANDOM TOEPLITZ MATRIX AND THE SINE KERNEL
    Sen, Arnab
    Virag, Balint
    ANNALS OF PROBABILITY, 2013, 41 (06): : 4050 - 4079
  • [28] Multi Kernel Fuzzy Clustering With Unsupervised Random Forests Kernel and Matrix-Induced Regularization
    Zhao, Yin-Ping
    Chen, Long
    Gan, Min
    Chen, C. L. Philip
    IEEE ACCESS, 2019, 7 : 3967 - 3979
  • [29] Sparsity Based Feature Extraction for Kernel Minimum Squared Error
    Jiang, Jiang
    Chen, Xi
    Gan, Haitao
    Sang, Nong
    PATTERN RECOGNITION (CCPR 2014), PT I, 2014, 483 : 273 - 282
  • [30] ERROR ANALYSIS FOR A CLASS OF DEGENERATE-KERNEL METHODS
    SLOAN, IH
    NUMERISCHE MATHEMATIK, 1976, 25 (03) : 231 - 238