AdaSAM: Boosting sharpness-aware minimization with adaptive learning rate and momentum for neural networks

被引:5
|
作者
Sun, Hao [1 ]
Shen, Li [2 ]
Zhong, Qihuang [3 ]
Ding, Liang [2 ]
Chen, Shixiang [4 ]
Sun, Jingwei [1 ]
Li, Jing [1 ]
Sun, Guangzhong [1 ]
Tao, Dacheng [5 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci, Hefei 230026, Anhui, Peoples R China
[2] JD com, Beijing, Peoples R China
[3] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Hubei, Peoples R China
[4] Univ Sci & Technol China, Sch Math Sci, Hefei 230026, Anhui, Peoples R China
[5] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
关键词
Sharpness-aware minimization; Adaptive learning rate; Non-convex optimization; Momentum acceleration; Linear speedup; CONVERGENCE;
D O I
10.1016/j.neunet.2023.10.044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We root theoretically show that AdaSAM admits a O(1/ bT) convergence rate, which achieves linear speedup property with respect to mini-batch size b. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks and the synthetic task, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.
引用
收藏
页码:506 / 519
页数:14
相关论文
共 50 条
  • [21] Binary Quantized Network Training With Sharpness-Aware Minimization
    Liu, Ren
    Bian, Fengmiao
    Zhang, Xiaoqun
    JOURNAL OF SCIENTIFIC COMPUTING, 2023, 94 (01)
  • [22] CR-SAM: Curvature Regularized Sharpness-Aware Minimization
    Wu, Tao
    Luo, Tie
    Wunsch, Donald C., II
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6144 - 6152
  • [23] Sharpness-Aware Minimization Leads to Low-Rank Features
    Andriushchenko, Maksym
    Bahri, Dara
    Mobahi, Hossein
    Flammarion, Nicolas
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach
    Mi, Peng
    Shen, Li
    Ren, Tianhe
    Zhou, Yiyi
    Sun, Xiaoshuai
    Ji, Rongrong
    Tao, Dacheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [25] Why Does Sharpness-Aware Minimization Generalize Better Than SGD?
    Chen, Zixiang
    Zhang, Junkai
    Kou, Yiwen
    Chen, Xiangning
    Hsieh, Cho-Jui
    Gu, Quanquan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [26] TOWARDS BOOSTING BLACK-BOX ATTACK VIA SHARPNESS-AWARE
    Zhang, Yukun
    Yuan, Shengming
    Song, Jingkuan
    Zhou, Yixuan
    Zhang, Lin
    He, Yulan
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 294 - 299
  • [27] A Retinal Vessel Segmentation Method Based on the Sharpness-Aware Minimization Model
    Mariam, Iqra
    Xue, Xiaorong
    Gadson, Kaleb
    SENSORS, 2024, 24 (13)
  • [28] Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima
    Si, Dongkuk
    Yun, Chulhee
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [29] Research on the improvement of domain generalization by the fusion of invariant features and sharpness-aware minimization
    Yixuan Yang
    Mingrong Dong
    Kai Zeng
    Tao Shen
    The Journal of Supercomputing, 2025, 81 (1)
  • [30] Federated Model-Agnostic Meta-Learning With Sharpness-Aware Minimization for Internet of Things Optimization
    Wu, Qingtao
    Zhang, Yong
    Liu, Muhua
    Zhu, Junlong
    Zheng, Ruijuan
    Zhang, Mingchuan
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (19): : 31317 - 31330