Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

被引：0

作者：

Zhao, Chendong ^{[1
,2
]}

Wang, Jianzong ^{[1
]}

Wei, Wenqi ^{[1
]}

Qu, Xiaoyang ^{[1
]}

Wang, Haoqian ^{[2
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China

[2] Tsinghua Univ, Shenzhen Int Grad Sch, Beijing, Peoples R China

来源：

2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA) | 2022年

关键词：

Automatic Speech Recognition; Sparse Attention; Monotonic Attention; Self-Attention;

D O I：

10.1109/DSAA54385.2022.10032360

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Transformer architecture model, based on selfattention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.

引用

页码：173 / 180

页数：8

共 50 条

[21] Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention
Liang, Chengdong
Xu, Menglong
Zhang, Xiao-Lei
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 2 : 1495 - 1499
[22] UNTIED POSITIONAL ENCODINGS FOR EFFICIENT TRANSFORMER-BASED SPEECH RECOGNITION
Samarakoon, Lahiru
Fung, Ivan
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 108 - 114
[23] The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition
Zhang, Enshi
Trujillo, Rafael
Poellabauer, Christian
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 13960 - 13970
[24] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
Che, Na
Zhu, Yiming
Wang, Haiyan
Zeng, Xianwei
Du, Qinsheng
APPLIED SCIENCES-BASEL, 2025, 15 (01):
[25] Intra-ensemble: A New Method for Combining Intermediate Outputs in Transformer-based Automatic Speech Recognition
Kim, DoHee
Choi, Jieun
Chang, Joon-Hyuk
INTERSPEECH 2023, 2023, : 2203 - 2207
[26] STRUCTURED SPARSE ATTENTION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
Xue, Jiabin
Zheng, Tieran
Han, Jiqing
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7044 - 7048
[27] Layer Sparse Transformer for Speech Recognition
Wang, Peng
Guo, Zhiyuan
Xie, Fei
2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 269 - 273
[28] Simulating reading mistakes for child speech Transformer-based phone recognition
Gelin, Lucile
Pellegrini, Thomas
Pinquier, Julien
Daniel, Morgane
INTERSPEECH 2021, 2021, : 3860 - 3864
[29] End to end transformer-based contextual speech recognition based on pointer network
Lin, Binghuai
Wang, Liyuan
INTERSPEECH 2021, 2021, : 2087 - 2091
[30] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
Lohrenz, Timo
Li, Zhengyang
Fingscheidt, Tim
INTERSPEECH 2021, 2021, : 2846 - 2850

← 1 2 3 4 5 →