Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

被引：1

作者：

Shim, Kyuhong ^{[1
]}

Choi, Iksoo ^{[1
]}

Sung, Wonyong ^{[1
]}

Choi, Jungwook ^{[2
]}

机构：

[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea

[2] Hanyang Univ, Dept Elect Engn, Seoul, South Korea

来源：

18TH INTERNATIONAL SOC DESIGN CONFERENCE 2021 (ISOCC 2021) | 2021年

基金：

新加坡国家研究基金会;

关键词：

pruning; transformer; multihead attention;

D O I：

10.1109/ISOCC53507.2021.9613933

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, the necessity of multiple attention heads in transformer architecture has been questioned [1]. Removing less important heads from a large network is a promising strategy to reduce computation cost and parameters. However, pruning out attention heads in multihead attention does not evenly reduce the overall load, because feedforward modules are not affected. In this study, we apply attention head pruning on All-attention [2] transformer, where savings in the computation are proportional to the number of pruned heads. This improved computing efficiency comes at the cost of pruning sensitivity, which we stabilize with three training techniques. Our attention head pruning enables a considerably fewer number of parameters with a comparable perplexity for transformer-based language modeling.

引用

页码：357 / 358

页数：2

共 50 条

[21] Optimization Based Layer-Wise Pruning Threshold Method for Accelerating Convolutional Neural Networks
Ding, Yunlong
Chen, Di-Rong
[J]. MATHEMATICS, 2023, 11 (15)
[22] Layer-Wise Training to Create Efficient Convolutional Neural Networks
Zeng, Linghua
Tian, Xinmei
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT II, 2017, 10635 : 631 - 641
[23] Differential Evolution Based Layer-Wise Weight Pruning for Compressing Deep Neural Networks
Wu, Tao
Li, Xiaoyang
Zhou, Deyun
Li, Na
Shi, Jiao
[J]. SENSORS, 2021, 21 (03) : 1 - 20
[24] Differential evolution based layer-wise weight pruning for compressing deep neural networks
Wu, Tao
Li, Xiaoyang
Zhou, Deyun
Li, Na
Shi, Jiao
[J]. Sensors (Switzerland), 2021, 21 (03): : 1 - 20
[25] Post-training deep neural network pruning via layer-wise calibration
Lazarevich, Ivan
Kozlov, Alexander
Malinin, Nikita
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 798 - 805
[26] Modeling of Layer-wise Additive Manufacturing for Part Quality Prediction
Zhang, Jianjing
Wang, Peng
Gao, Robert X.
[J]. PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON THROUGH-LIFE ENGINEERING SERVICES, 2018, 16 : 155 - 162
[27] Layer-wise enhanced transformer with multi-modal fusion for image caption
Li, Jingdan
Wang, Yi
Zhao, Dexin
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (03) : 1043 - 1056
[28] Layer-wise enhanced transformer with multi-modal fusion for image caption
Jingdan Li
Yi Wang
Dexin Zhao
[J]. Multimedia Systems, 2023, 29 : 1043 - 1056
[29] How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations
van Aken, Betty
Winter, Benjamin
Loeser, Alexander
Gers, Felix A.
[J]. PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 1823 - 1832
[30] Sequential attention layer-wise fusion network for multi-view classification
Teng, Qing
Yang, Xibei
Sun, Qiguo
Wang, Pingxin
Wang, Xun
Xu, Taihua
[J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, : 5549 - 5561

← 1 2 3 4 5 →