AdaSAM: Boosting sharpness-aware minimization with adaptive learning rate and momentum for neural networks

被引:5
|
作者
Sun, Hao [1 ]
Shen, Li [2 ]
Zhong, Qihuang [3 ]
Ding, Liang [2 ]
Chen, Shixiang [4 ]
Sun, Jingwei [1 ]
Li, Jing [1 ]
Sun, Guangzhong [1 ]
Tao, Dacheng [5 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci, Hefei 230026, Anhui, Peoples R China
[2] JD com, Beijing, Peoples R China
[3] Wuhan Univ, Sch Comp Sci, Wuhan 430072, Hubei, Peoples R China
[4] Univ Sci & Technol China, Sch Math Sci, Hefei 230026, Anhui, Peoples R China
[5] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
关键词
Sharpness-aware minimization; Adaptive learning rate; Non-convex optimization; Momentum acceleration; Linear speedup; CONVERGENCE;
D O I
10.1016/j.neunet.2023.10.044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sharpness aware minimization (SAM) optimizer has been extensively explored as it can generalize better for training deep neural networks via introducing extra perturbation steps to flatten the landscape of deep learning models. Integrating SAM with adaptive learning rate and momentum acceleration, dubbed AdaSAM, has already been explored empirically to train large-scale deep neural networks without theoretical guarantee due to the triple difficulties in analyzing the coupled perturbation step, adaptive learning rate and momentum step. In this paper, we try to analyze the convergence rate of AdaSAM in the stochastic non-convex setting. We root theoretically show that AdaSAM admits a O(1/ bT) convergence rate, which achieves linear speedup property with respect to mini-batch size b. Specifically, to decouple the stochastic gradient steps with the adaptive learning rate and perturbed gradient, we introduce the delayed second-order momentum term to decompose them to make them independent while taking an expectation during the analysis. Then we bound them by showing the adaptive learning rate has a limited range, which makes our analysis feasible. To the best of our knowledge, we are the first to provide the non-trivial convergence rate of SAM with an adaptive learning rate and momentum acceleration. At last, we conduct several experiments on several NLP tasks and the synthetic task, which show that AdaSAM could achieve superior performance compared with SGD, AMSGrad, and SAM optimizers.
引用
收藏
页码:506 / 519
页数:14
相关论文
共 50 条
  • [1] ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks
    Kwon, Jungmin
    Kim, Jeongseop
    Park, Hyunseo
    Choi, In Kwon
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [2] Adaptive Sharpness-Aware Minimization for Adversarial Domain Generalization
    Xie, Tianci
    Li, Tao
    Wu, Ruoxue
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,
  • [3] Random Sharpness-Aware Minimization
    Liu, Yong
    Mai, Siqi
    Cheng, Minhao
    Chen, Xiangning
    Hsieh, Cho-Jui
    You, Yang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [4] FedGAMMA: Federated Learning With Global Sharpness-Aware Minimization
    Dai, Rong
    Yang, Xun
    Sun, Yan
    Shen, Li
    Tian, Xinmei
    Wang, Meng
    Zhang, Yongdong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (12) : 1 - 14
  • [5] Towards Understanding Sharpness-Aware Minimization
    Andriushchenko, Maksym
    Flammarion, Nicolas
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 639 - 668
  • [6] Sharpness-Aware Minimization and the Edge of Stability
    Long, Philip M.
    Bartlett, Peter L.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 20
  • [7] Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term
    Yue, Yun
    Jiang, Jiadi
    Ye, Zhiling
    Gao, Ning
    Liu, Yongchao
    Zhang, Ke
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 3185 - 3194
  • [8] Implicit Sharpness-Aware Minimization for Domain Generalization
    Dong, Mingrong
    Yang, Yixuan
    Zeng, Kai
    Wang, Qingwang
    Shen, Tao
    REMOTE SENSING, 2024, 16 (16)
  • [9] Towards Efficient and Scalable Sharpness-Aware Minimization
    Liu, Yong
    Mai, Siqi
    Chen, Xiangning
    Hsieh, Cho-Jui
    You, Yang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12350 - 12360
  • [10] The Crucial Role of Normalization in Sharpness-Aware Minimization
    Dai, Yan
    Ahn, Kwangjun
    Sra, Suvrit
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,