Gradient Sparsification For Masked Fine-Tuning of Transformers

被引:0
|
作者
O'Neill, James [1 ]
Dutta, Sourav [1 ]
机构
[1] Huawei Ireland Res Ctr, Dublin, Ireland
关键词
neural nets; sparse regularization; fine-tuning;
D O I
10.1109/IJCNN54540.2023.10191206
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Fine-tuning pretrained self-supervised language models is widely adopted for transfer learning to downstream tasks. Fine-tuning can be achieved by freezing gradients of the pretrained network and only updating gradients of a newly added classification layer, or by performing gradient updates on all parameters. Gradual unfreezing makes a trade-off between the two by gradually unfreezing gradients of whole layers during training. This has been an effective strategy to trade-off between storage and training speed with generalization performance. However, it is not clear whether gradually unfreezing layers throughout training is optimal, compared to sparse variants of gradual unfreezing which may improve fine-tuning performance. In this paper, we propose to stochastically mask gradients to regularize pretrained language models for improving overall fine-tuned performance. We introduce GradDrop and variants thereof, a class of gradient sparsification methods that mask gradients during the backward pass, acting as gradient noise. GradDrop is sparse and stochastic unlike gradual freezing. Extensive experiments on the multilingual XGLUE benchmark with XLMR-Large show that GradDrop is competitive against methods that use additional translated data for intermediate pretraining and outperforms standard fine-tuning and gradual unfreezing. A post-analysis shows how GradDrop improves performance with languages it was not trained on, such as under-resourced languages.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Enhancing Transformers with Gradient Boosted Decision Trees for NLI Fine-Tuning
    Minixhofer, Benjamin
    Gritta, Milan
    Iacobacci, Ignacio
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 303 - 313
  • [2] Fine-tuning transformers: Vocabulary transfer
    Mosin, Vladislav
    Samenko, Igor
    Kozlovskii, Borislav
    Tikhonov, Alexey
    Yamshchikov, Ivan P.
    ARTIFICIAL INTELLIGENCE, 2023, 317
  • [3] On the Interplay Between Fine-tuning and Composition in Transformers
    Yu, Lang
    Ettinger, Allyson
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2279 - 2293
  • [4] Masked Images Are Counterfactual Samples for Robust Fine-tuning
    Xiao, Yao
    Tang, Ziyi
    Wei, Pengxu
    Liu, Cong
    Lin, Liang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 20301 - 20310
  • [5] Multi-resolution Fine-Tuning of Vision Transformers
    Fitzgerald, Kerr
    Law, Meng
    Seah, Jarrel
    Tang, Jennifer
    Matuszewski, Bogdan
    MEDICAL IMAGE UNDERSTANDING AND ANALYSIS, MIUA 2022, 2022, 13413 : 535 - 546
  • [6] MMSFT: Multilingual Multimodal Summarization by Fine-Tuning Transformers
    Phani, Siginamsetty
    Abdul, Ashu
    Krishna Siva Prasad, M.
    Kumar Deva Sarma, Hiren
    IEEE ACCESS, 2024, 12 : 129673 - 129689
  • [7] Fine-tuning Image Transformers using Learnable Memory
    Sandler, Mark
    Zhmoginov, Andrey
    Vladymyrov, Max
    Jackson, Andrew
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12145 - 12154
  • [8] Trainable Projected Gradient Method for Robust Fine-tuning
    Tian, Junjiao
    Dai, Xiaoliang
    Ma, Chih-Yao
    He, Zecheng
    Liu, Yen-Cheng
    Kira, Zsolt
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7836 - 7845
  • [9] Fine-tuning
    不详
    AVIATION WEEK & SPACE TECHNOLOGY, 2001, 155 (02): : 21 - 21
  • [10] Fine-Tuning
    Manson, Neil A.
    TPM-THE PHILOSOPHERS MAGAZINE, 2019, (86): : 99 - 105