Differentiable Subset Pruning of Transformer Heads

被引:14
|
作者
Li, Jiaoda [1 ]
Cotterell, Ryan [1 ,2 ]
Sachan, Mrinmaya [1 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] Univ Cambridge, Cambridge, England
关键词
D O I
10.1162/tacl_a_00436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns perhead importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.(1)
引用
收藏
页码:1442 / 1459
页数:18
相关论文
共 50 条
  • [1] Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation
    Behnke, Maximiliana
    Heafield, Kenneth
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2664 - 2674
  • [2] Differentiable Transportation Pruning
    Li, Yunqiang
    van Gemert, Jan C.
    Hoefler, Torsten
    Moons, Bert
    Eleftheriou, Evangelos
    Verhoef, Bram-Ernst
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 16911 - 16921
  • [3] Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling
    Shim, Kyuhong
    Choi, Iksoo
    Sung, Wonyong
    Choi, Jungwook
    [J]. 18TH INTERNATIONAL SOC DESIGN CONFERENCE 2021 (ISOCC 2021), 2021, : 357 - 358
  • [4] Disentangled Differentiable Network Pruning
    Gao, Shangqian
    Huang, Feihu
    Zhang, Yanfu
    Huang, Heng
    [J]. COMPUTER VISION, ECCV 2022, PT XI, 2022, 13671 : 328 - 345
  • [5] Differentiable Mask for Pruning Convolutional and Recurrent Networks
    Ramakrishnan, Ramchalam Kinattinkara
    Sari, Eyyub
    Nia, Vahid Partovi
    [J]. 2020 17TH CONFERENCE ON COMPUTER AND ROBOT VISION (CRV 2020), 2020, : 222 - 229
  • [6] Shift Pruning: Equivalent Weight Pruning for CNN via Differentiable Shift Operator
    Niu, Tao
    Lou, Yihang
    Teng, Yinglei
    He, Jianzhong
    Liu, Yiding
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5445 - 5454
  • [7] Transformer Memory as a Differentiable Search Index
    Tay, Yi
    Tran, Vinh Q.
    Dehghani, Mostafa
    Ni, Jianmo
    Bahri, Dara
    Mehta, Harsh
    Qin, Zhen
    Hui, Kai
    Zhao, Zhe
    Gupta, Jai
    Schuster, Tal
    Cohen, WilliamW.
    Metzler, Donald
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [8] Model Compression Based on Differentiable Network Channel Pruning
    Zheng, Yu-Jie
    Chen, Si-Bao
    Ding, Chris H. Q.
    Luo, Bin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10203 - 10212
  • [9] DMCP: Differentiable Markov Channel Pruning for Neural Networks
    Guo, Shaopeng
    Wang, Yujie
    Li, Quanquan
    Yan, Junjie
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 1536 - 1544
  • [10] Differentiable channel pruning guided via attention mechanism: a novel neural network pruning approach
    Hanjing Cheng
    Zidong Wang
    Lifeng Ma
    Zhihui Wei
    Fawaz E. Alsaadi
    Xiaohui Liu
    [J]. Complex & Intelligent Systems, 2023, 9 : 5611 - 5624