Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration

被引：34

作者：

Mei, Song ^{[1
]}

Misiakiewicz, Theodor ^{[2
]}

Montanari, Andrea ^{[2
,3
]}

机构：

[1] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA

[2] Stanford Univ, Dept Stat, Stanford, CA 94305 USA

[3] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA

来源：

APPLIED AND COMPUTATIONAL HARMONIC ANALYSIS | 2022年 / 59卷

关键词：

Random features; Kernel methods; Generalization error; High dimensional limit; INEQUALITIES;

D O I：

10.1016/j.acha.2021.12.003

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Consider the classical supervised learning problem: we are given data (yi,xi), i <= n, with yi a response and x(i) is an element of X a covariates vector, and try to learn a model f : X -> R to predict future responses. Random features methods map the covariates vector x(i) to a point phi(x(i)) in a higher dimensional space R-N, via a random featurization map phi. We study the use of random features methods in conjunction with ridge regression in the feature space R-N. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: (1) What is the generalization error of KRR? (2) How big N should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top l eigenfunctions of the kernel, where l depends on the sample size n. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as N <= n(1-delta) for some delta > 0. We characterize this gap. For N >= n(1+delta), random features achieve the same error as the corresponding KRR, and further increasing N does not lead to a significant change in test error. (c) 2021 Elsevier Inc. All rights reserved.

引用

页码：3 / 84

页数：82

共 50 条

[21] Kernel conjugate gradient methods with random projections
Lin, Junhong
Cevher, Volkan
APPLIED AND COMPUTATIONAL HARMONIC ANALYSIS, 2021, 55 : 223 - 269
[22] Accurate Probabilistic Error Bound for Eigenvalues of Kernel Matrix
Jia, Lei
Liao, Shizhong
ADVANCES IN MACHINE LEARNING, PROCEEDINGS, 2009, 5828 : 162 - 175
[23] Parallel approximate matrix factorization for kernel methods
Zhu, Kaihua
Cui, Hang
Bai, Hongjie
Li, Jian
Qiu, Zhihuan
Wang, Hao
Xu, Hui
Chang, Edward Y.
2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 1275 - 1278
[24] The Slow Deterioration of the Generalization Error of the Random Feature Model
Ma, Chao
Wu, Lei
Weinan, E.
MATHEMATICAL AND SCIENTIFIC MACHINE LEARNING, VOL 107, 2020, 107 : 373 - +
[25] Fast generalization error bound of deep learning from a kernel perspective
Suzuki, Taiji
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84, 2018, 84
[26] A sparsity driven kernel machine based on minimizing a generalization error bound
Peleg, Dori
Meir, Ron
PATTERN RECOGNITION, 2009, 42 (11) : 2607 - 2614
[27] THE TOP EIGENVALUE OF THE RANDOM TOEPLITZ MATRIX AND THE SINE KERNEL
Sen, Arnab
Virag, Balint
ANNALS OF PROBABILITY, 2013, 41 (06): : 4050 - 4079
[28] Multi Kernel Fuzzy Clustering With Unsupervised Random Forests Kernel and Matrix-Induced Regularization
Zhao, Yin-Ping
Chen, Long
Gan, Min
Chen, C. L. Philip
IEEE ACCESS, 2019, 7 : 3967 - 3979
[29] Sparsity Based Feature Extraction for Kernel Minimum Squared Error
Jiang, Jiang
Chen, Xi
Gan, Haitao
Sang, Nong
PATTERN RECOGNITION (CCPR 2014), PT I, 2014, 483 : 273 - 282
[30] ERROR ANALYSIS FOR A CLASS OF DEGENERATE-KERNEL METHODS
SLOAN, IH
NUMERISCHE MATHEMATIK, 1976, 25 (03) : 231 - 238

← 1 2 3 4 5 →