Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration

被引:34
|
作者
Mei, Song [1 ]
Misiakiewicz, Theodor [2 ]
Montanari, Andrea [2 ,3 ]
机构
[1] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[2] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
[3] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA
关键词
Random features; Kernel methods; Generalization error; High dimensional limit; INEQUALITIES;
D O I
10.1016/j.acha.2021.12.003
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Consider the classical supervised learning problem: we are given data (yi,xi), i <= n, with yi a response and x(i) is an element of X a covariates vector, and try to learn a model f : X -> R to predict future responses. Random features methods map the covariates vector x(i) to a point phi(x(i)) in a higher dimensional space R-N, via a random featurization map phi. We study the use of random features methods in conjunction with ridge regression in the feature space R-N. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: (1) What is the generalization error of KRR? (2) How big N should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top l eigenfunctions of the kernel, where l depends on the sample size n. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as N <= n(1-delta) for some delta > 0. We characterize this gap. For N >= n(1+delta), random features achieve the same error as the corresponding KRR, and further increasing N does not lead to a significant change in test error. (c) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页码:3 / 84
页数:82
相关论文
共 50 条
  • [1] Generalization Error Bounds for Kernel Matrix Completion and Extrapolation
    Gimenez-Febrer, Pere
    Pages-Zamora, Alba
    Giannakis, Georgios B.
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 326 - 330
  • [2] On the eigenspectrum of the Gram matrix and the generalization error of kernel-PCA
    Shawe-Taylor, J
    Williams, CKI
    Cristianini, N
    Kandola, J
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2005, 51 (07) : 2510 - 2522
  • [3] Generalization error analysis for polynomial kernel methods - Algebraic geometrical approach
    Ikeda, K
    ARTIFICAIL NEURAL NETWORKS AND NEURAL INFORMATION PROCESSING - ICAN/ICONIP 2003, 2003, 2714 : 201 - 208
  • [4] Analyses on Generalization Error of Ensemble Kernel Regressors
    Tanaka, Akira
    Takigawa, Ichigaku
    Imai, Hideyuki
    Kudo, Mineichi
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2014, 8621 : 273 - 281
  • [5] Random Forests and Kernel Methods
    Scornet, Erwan
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2016, 62 (03) : 1485 - 1500
  • [6] Random Feature Maps for the Itemset Kernel
    Atarashi, Kyohei
    Maji, Subhransu
    Oyama, Satoshi
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 3199 - 3206
  • [7] Generalization Error of Minimum Weighted Norm and Kernel Interpolation
    Li, Weilin
    SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE, 2021, 3 (01): : 414 - 438
  • [8] Distributionally Robust Optimization and Generalization in Kernel Methods
    Staib, Matthew
    Jegelka, Stefanie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [9] ON THE OPTIMAL ERROR OF DEGENERATE KERNEL METHODS
    HEINRICH, S
    JOURNAL OF INTEGRAL EQUATIONS, 1985, 9 (03): : 251 - 266
  • [10] Kernel methods for heterogeneous feature selection
    Paul, Jerome
    D'Ambrosio, Roberto
    Dupont, Pierre
    NEUROCOMPUTING, 2015, 169 : 187 - 195