Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

被引:0
|
作者
Zhao, Chendong [1 ,2 ]
Wang, Jianzong [1 ]
Wei, Wenqi [1 ]
Qu, Xiaoyang [1 ]
Wang, Haoqian [2 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Tsinghua Univ, Shenzhen Int Grad Sch, Beijing, Peoples R China
关键词
Automatic Speech Recognition; Sparse Attention; Monotonic Attention; Self-Attention;
D O I
10.1109/DSAA54385.2022.10032360
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Transformer architecture model, based on selfattention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
引用
收藏
页码:173 / 180
页数:8
相关论文
共 50 条
  • [21] Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention
    Liang, Chengdong
    Xu, Menglong
    Zhang, Xiao-Lei
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 2 : 1495 - 1499
  • [22] UNTIED POSITIONAL ENCODINGS FOR EFFICIENT TRANSFORMER-BASED SPEECH RECOGNITION
    Samarakoon, Lahiru
    Fung, Ivan
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 108 - 114
  • [23] The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition
    Zhang, Enshi
    Trujillo, Rafael
    Poellabauer, Christian
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 13960 - 13970
  • [24] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
    Che, Na
    Zhu, Yiming
    Wang, Haiyan
    Zeng, Xianwei
    Du, Qinsheng
    APPLIED SCIENCES-BASEL, 2025, 15 (01):
  • [25] Intra-ensemble: A New Method for Combining Intermediate Outputs in Transformer-based Automatic Speech Recognition
    Kim, DoHee
    Choi, Jieun
    Chang, Joon-Hyuk
    INTERSPEECH 2023, 2023, : 2203 - 2207
  • [26] STRUCTURED SPARSE ATTENTION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
    Xue, Jiabin
    Zheng, Tieran
    Han, Jiqing
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7044 - 7048
  • [27] Layer Sparse Transformer for Speech Recognition
    Wang, Peng
    Guo, Zhiyuan
    Xie, Fei
    2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 269 - 273
  • [28] Simulating reading mistakes for child speech Transformer-based phone recognition
    Gelin, Lucile
    Pellegrini, Thomas
    Pinquier, Julien
    Daniel, Morgane
    INTERSPEECH 2021, 2021, : 3860 - 3864
  • [29] End to end transformer-based contextual speech recognition based on pointer network
    Lin, Binghuai
    Wang, Liyuan
    INTERSPEECH 2021, 2021, : 2087 - 2091
  • [30] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
    Lohrenz, Timo
    Li, Zhengyang
    Fingscheidt, Tim
    INTERSPEECH 2021, 2021, : 2846 - 2850