AdaSAM: Boosting sharpness-aware minimization with adaptive learning rate and momentum for neural networks

被引：5

作者：

Sun, Hao ^{[1
]}

Shen, Li ^{[2
]}

Zhong, Qihuang ^{[3
]}

Ding, Liang ^{[2
]}

Chen, Shixiang ^{[4
]}

Sun, Jingwei ^{[1
]}

Li, Jing ^{[1
]}

Sun, Guangzhong ^{[1
]}

Tao, Dacheng ^{[5
]}

机构：

[1] Univ Sci & Technol China, Sch Comp Sci, Hefei 230026, Anhui, Peoples R China

[2] JD com, Beijing, Peoples R China

[3] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Hubei, Peoples R China

[4] Univ Sci & Technol China, Sch Math Sci, Hefei 230026, Anhui, Peoples R China

[5] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia

来源：

NEURAL NETWORKS | 2024年 / 169卷

关键词：

Sharpness-aware minimization; Adaptive learning rate; Non-convex optimization; Momentum acceleration; Linear speedup; CONVERGENCE;

D O I：

10.1016/j.neunet.2023.10.044

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We root theoretically show that AdaSAM admits a O(1/ bT) convergence rate, which achieves linear speedup property with respect to mini-batch size b. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks and the synthetic task, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.

引用

页码：506 / 519

页数：14

共 50 条

[41] Sharpness-aware gradient guidance for few-shot class-incremental learning
Chen, Runhang
Jing, Xiao-Yuan
Wu, Fei
Chen, Haowen
KNOWLEDGE-BASED SYSTEMS, 2024, 299
[42] Bayesian Sharpness-Aware Prompt Tuning for Cross-Domain Few-shot Learning
Fan, Shuo
Zhuang, Liansheng
Li, Aodi
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[43] Adaptive Learning Rate for Unsupervised Learning of Deep Neural Networks
Golovko, Vladimir
Mikhno, Egor
Kroschanka, Aliaksandr
Chodyka, Marta
Lichograj, Piotr
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[44] The Effect of Adaptive Learning Rate on the Accuracy of Neural Networks
Jepkoech, Jennifer
Mugo, David Muchangi
Kenduiywo, Benson K.
Too, Edna Chebet
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (08) : 736 - 751
[45] Training methods for Adaptive Boosting of neural networks
Schwenk, H
Bengio, Y
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 10, 1998, 10 : 647 - 653
[46] Performance Enhancement of Adaptive Neural Networks Based on Learning Rate
Zubair, Swaleha
Singha, Anjani Kumar
Pathak, Nitish
Sharma, Neelam
Urooj, Shabana
Larguech, Samia Rabeh
CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (01): : 2005 - 2019
[47] A novel approach for implementation of adaptive learning rate neural networks
Rezaie, MG
Farbiz, F
Moghaddam, EZ
Hooshmand, A
22ND NORCHIP CONFERENCE, PROCEEDINGS, 2004, : 79 - 82
[48] An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks
Wang, Kang
Sun, Tao
Dou, Yong
NEURAL PROCESSING LETTERS, 2022, 54 (02) : 803 - 816
[49] An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks
Kang Wang
Tao Sun
Yong Dou
Neural Processing Letters, 2022, 54 : 803 - 816
[50] Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks
Fan, Qinwei
Wu, Wei
Zurada, Jacek M.
SPRINGERPLUS, 2016, 5

← 1 2 3 4 5 →