Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

被引:1
|
作者
Shim, Kyuhong [1 ]
Choi, Iksoo [1 ]
Sung, Wonyong [1 ]
Choi, Jungwook [2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea
[2] Hanyang Univ, Dept Elect Engn, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
pruning; transformer; multihead attention;
D O I
10.1109/ISOCC53507.2021.9613933
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, the necessity of multiple attention heads in transformer architecture has been questioned [1]. Removing less important heads from a large network is a promising strategy to reduce computation cost and parameters. However, pruning out attention heads in multihead attention does not evenly reduce the overall load, because feedforward modules are not affected. In this study, we apply attention head pruning on All-attention [2] transformer, where savings in the computation are proportional to the number of pruned heads. This improved computing efficiency comes at the cost of pruning sensitivity, which we stabilize with three training techniques. Our attention head pruning enables a considerably fewer number of parameters with a comparable perplexity for transformer-based language modeling.
引用
收藏
页码:357 / 358
页数:2
相关论文
共 50 条
  • [21] Optimization Based Layer-Wise Pruning Threshold Method for Accelerating Convolutional Neural Networks
    Ding, Yunlong
    Chen, Di-Rong
    [J]. MATHEMATICS, 2023, 11 (15)
  • [22] Layer-Wise Training to Create Efficient Convolutional Neural Networks
    Zeng, Linghua
    Tian, Xinmei
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT II, 2017, 10635 : 631 - 641
  • [23] Differential Evolution Based Layer-Wise Weight Pruning for Compressing Deep Neural Networks
    Wu, Tao
    Li, Xiaoyang
    Zhou, Deyun
    Li, Na
    Shi, Jiao
    [J]. SENSORS, 2021, 21 (03) : 1 - 20
  • [24] Differential evolution based layer-wise weight pruning for compressing deep neural networks
    Wu, Tao
    Li, Xiaoyang
    Zhou, Deyun
    Li, Na
    Shi, Jiao
    [J]. Sensors (Switzerland), 2021, 21 (03): : 1 - 20
  • [25] Post-training deep neural network pruning via layer-wise calibration
    Lazarevich, Ivan
    Kozlov, Alexander
    Malinin, Nikita
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 798 - 805
  • [26] Modeling of Layer-wise Additive Manufacturing for Part Quality Prediction
    Zhang, Jianjing
    Wang, Peng
    Gao, Robert X.
    [J]. PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON THROUGH-LIFE ENGINEERING SERVICES, 2018, 16 : 155 - 162
  • [27] Layer-wise enhanced transformer with multi-modal fusion for image caption
    Li, Jingdan
    Wang, Yi
    Zhao, Dexin
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (03) : 1043 - 1056
  • [28] Layer-wise enhanced transformer with multi-modal fusion for image caption
    Jingdan Li
    Yi Wang
    Dexin Zhao
    [J]. Multimedia Systems, 2023, 29 : 1043 - 1056
  • [29] How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations
    van Aken, Betty
    Winter, Benjamin
    Loeser, Alexander
    Gers, Felix A.
    [J]. PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 1823 - 1832
  • [30] Sequential attention layer-wise fusion network for multi-view classification
    Teng, Qing
    Yang, Xibei
    Sun, Qiguo
    Wang, Pingxin
    Wang, Xun
    Xu, Taihua
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, : 5549 - 5561