Differentiable Subset Pruning of Transformer Heads

被引：14

作者：

Li, Jiaoda ^{[1
]}

Cotterell, Ryan ^{[1
,2
]}

Sachan, Mrinmaya ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Zurich, Switzerland

[2] Univ Cambridge, Cambridge, England

来源：

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS | 2021年 / 9卷

关键词：

D O I：

10.1162/tacl_a_00436

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns perhead importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.(1)

引用

页码：1442 / 1459

页数：18

共 50 条

[1] Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation
Behnke, Maximiliana
Heafield, Kenneth
[J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2664 - 2674
[2] Differentiable Transportation Pruning
Li, Yunqiang
van Gemert, Jan C.
Hoefler, Torsten
Moons, Bert
Eleftheriou, Evangelos
Verhoef, Bram-Ernst
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 16911 - 16921
[3] Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling
Shim, Kyuhong
Choi, Iksoo
Sung, Wonyong
Choi, Jungwook
[J]. 18TH INTERNATIONAL SOC DESIGN CONFERENCE 2021 (ISOCC 2021), 2021, : 357 - 358
[4] Disentangled Differentiable Network Pruning
Gao, Shangqian
Huang, Feihu
Zhang, Yanfu
Huang, Heng
[J]. COMPUTER VISION, ECCV 2022, PT XI, 2022, 13671 : 328 - 345
[5] Differentiable Mask for Pruning Convolutional and Recurrent Networks
Ramakrishnan, Ramchalam Kinattinkara
Sari, Eyyub
Nia, Vahid Partovi
[J]. 2020 17TH CONFERENCE ON COMPUTER AND ROBOT VISION (CRV 2020), 2020, : 222 - 229
[6] Shift Pruning: Equivalent Weight Pruning for CNN via Differentiable Shift Operator
Niu, Tao
Lou, Yihang
Teng, Yinglei
He, Jianzhong
Liu, Yiding
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5445 - 5454
[7] Transformer Memory as a Differentiable Search Index
Tay, Yi
Tran, Vinh Q.
Dehghani, Mostafa
Ni, Jianmo
Bahri, Dara
Mehta, Harsh
Qin, Zhen
Hui, Kai
Zhao, Zhe
Gupta, Jai
Schuster, Tal
Cohen, WilliamW.
Metzler, Donald
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[8] Model Compression Based on Differentiable Network Channel Pruning
Zheng, Yu-Jie
Chen, Si-Bao
Ding, Chris H. Q.
Luo, Bin
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10203 - 10212
[9] DMCP: Differentiable Markov Channel Pruning for Neural Networks
Guo, Shaopeng
Wang, Yujie
Li, Quanquan
Yan, Junjie
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 1536 - 1544
[10] Differentiable channel pruning guided via attention mechanism: a novel neural network pruning approach
Hanjing Cheng
Zidong Wang
Lifeng Ma
Zhihui Wei
Fawaz E. Alsaadi
Xiaohui Liu
[J]. Complex & Intelligent Systems, 2023, 9 : 5611 - 5624

← 1 2 3 4 5 →