Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

被引:0
|
作者
Behnke, Maximiliana [1 ]
Heafield, Kenneth [1 ]
机构
[1] Univ Edinburgh, Edinburgh, Midlothian, Scotland
基金
英国工程与自然科学研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The attention mechanism is the crucial component of the transformer architecture. Recent research shows that most attention heads are not confident in their decisions and can be pruned after training. However, removing them before training a model results in lower quality. In this paper, we apply the lottery ticket hypothesis to prune heads in the early stages of training, instead of doing so on a fully converged model. Our experiments on machine translation show that it is possible to remove up to three-quarters of all attention heads from a transformer-big model with an average -0.1 change in BLEU for Turkish -> English. The pruned model is 1.5 times as fast at inference, albeit at the cost of longer training. The method is complementary to other approaches, such as teacher-student, with our English!German student losing 0.2 BLEU at 75% encoder attention sparsity.
引用
收藏
页码:2664 / 2674
页数:11
相关论文
共 50 条
  • [21] Pruning-then-Expanding Model for Domain Adaptation of Neural Machine Translation
    Gu, Shuhao
    Feng, Yang
    Xie, Wanying
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3942 - 3952
  • [22] Neural Machine Translation with Target-Attention Model
    Yang, Mingming
    Zhang, Min
    Chen, Kehai
    Wang, Rui
    Zhao, Tiejun
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (03) : 684 - 694
  • [23] Attention With Sparsity Regularization for Neural Machine Translation and Summarization
    Zhang, Jiajun
    Zhao, Yang
    Li, Haoran
    Zong, Chengqing
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (03) : 507 - 518
  • [24] Syntax-Directed Attention for Neural Machine Translation
    Chen, Kehai
    Wang, Rui
    Utiyama, Masao
    Sumita, Eiichiro
    Zhao, Tiejun
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 4792 - 4799
  • [25] Dynamic Attention Aggregation with BERT for Neural Machine Translation
    Zhang, JiaRui
    Li, HongZheng
    Shi, ShuMin
    Huang, HeYan
    Hu, Yue
    Wei, XiangPeng
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [26] Attention based English to Punjabi neural machine translation
    Singh, Shivkaran
    Kumar, M. Anand
    Soman, K. P.
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2018, 34 (03) : 1551 - 1559
  • [27] Simultaneous neural machine translation with a reinforced attention mechanism
    Lee, YoHan
    Shin, JongHun
    Kim, YoungKil
    [J]. ETRI JOURNAL, 2021, 43 (05) : 775 - 786
  • [28] Measuring and Improving Faithfulness of Attention in Neural Machine Translation
    Moradi, Pooya
    Kambhatla, Nishant
    Sarkar, Anoop
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2791 - 2802
  • [29] Beyond the Transformer: A Novel Polynomial Inherent Attention (PIA) Model and Its Great Impact on Neural Machine Translation
    ELAffendi, Mohammed
    Alrajhi, Khawlah
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [30] Syntax-aware Transformer Encoder for Neural Machine Translation
    Duan, Sufeng
    Zhao, Hai
    Zhou, Junru
    Wang, Rui
    [J]. PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 396 - 401