Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

被引:1
|
作者
Shim, Kyuhong [1 ]
Choi, Iksoo [1 ]
Sung, Wonyong [1 ]
Choi, Jungwook [2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea
[2] Hanyang Univ, Dept Elect Engn, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
pruning; transformer; multihead attention;
D O I
10.1109/ISOCC53507.2021.9613933
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, the necessity of multiple attention heads in transformer architecture has been questioned [1]. Removing less important heads from a large network is a promising strategy to reduce computation cost and parameters. However, pruning out attention heads in multihead attention does not evenly reduce the overall load, because feedforward modules are not affected. In this study, we apply attention head pruning on All-attention [2] transformer, where savings in the computation are proportional to the number of pruned heads. This improved computing efficiency comes at the cost of pruning sensitivity, which we stabilize with three training techniques. Our attention head pruning enables a considerably fewer number of parameters with a comparable perplexity for transformer-based language modeling.
引用
收藏
页码:357 / 358
页数:2
相关论文
共 50 条
  • [1] Learning to Search Efficient DenseNet with Layer-wise Pruning
    Zhang, Xuanyang
    Liu, Hao
    Zhu, Zhanxing
    Xu, Zenglin
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [2] Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error
    Jiang, Chunhui
    Li, Guiying
    Qian, Chao
    Tang, Ke
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 2298 - 2304
  • [3] Transkimmer: Transformer Learns to Layer-wise Skim
    Guan, Yue
    Li, Zhengyi
    Leng, Jingwen
    Lin, Zhouhan
    Guo, Minyi
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7275 - 7286
  • [4] Layer-wise Model Pruning based on Mutual Information
    Fan, Chun
    Li, Jiwei
    Ao, Xiang
    Wu, Fei
    Meng, Yuxian
    Sun, Xiaofei
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3079 - 3090
  • [5] FedLP: Layer-wise Pruning Mechanism for Communication-Computation Efficient Federated Learning
    Zhu, Zheqi
    Shi, Yuchen
    Luo, Jiajun
    Wang, Fei
    Peng, Chenghui
    Fan, Pingyi
    Letaief, Khaled B.
    [J]. ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 1250 - 1255
  • [6] A Layer-wise Training and Pruning Method for Memory Efficient On-chip Learning Hardware
    Lew, Dongwoo
    Park, Jongsun
    [J]. 2022 19TH INTERNATIONAL SOC DESIGN CONFERENCE (ISOCC), 2022, : 97 - 98
  • [7] ONE-SHOT LAYER-WISE ACCURACY APPROXIMATION FOR LAYER PRUNING
    Elkerdawy, Sara
    Elhoushi, Mostafa
    Singh, Abhineet
    Zhang, Hong
    Ray, Nilanjan
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2940 - 2944
  • [8] Towards Efficient Federated Learning: Layer-Wise Pruning-Quantization Scheme and Coding Design
    Zhu, Zheqi
    Shi, Yuchen
    Xin, Gangtao
    Peng, Chenghui
    Fan, Pingyi
    Letaief, Khaled B.
    [J]. ENTROPY, 2023, 25 (08)
  • [9] Pruning Ratio Optimization with Layer-Wise Pruning Method for Accelerating Convolutional Neural Networks
    Kamma, Koji
    Inoue, Sarimu
    Wada, Toshikazu
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (01) : 161 - 169
  • [10] Layer-Wise External Attention by Well-Localized Attention Map for Efficient Deep Anomaly Detection
    Keiichi Nakanishi
    Ryo Shiroma
    Tokihisa Hayakawa
    Ryoya Katafuchi
    Terumasa Tokunaga
    [J]. SN Computer Science, 5 (5)