Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

被引:0
|
作者
Zhao, Chendong [1 ,2 ]
Wang, Jianzong [1 ]
Wei, Wenqi [1 ]
Qu, Xiaoyang [1 ]
Wang, Haoqian [2 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Tsinghua Univ, Shenzhen Int Grad Sch, Beijing, Peoples R China
关键词
Automatic Speech Recognition; Sparse Attention; Monotonic Attention; Self-Attention;
D O I
10.1109/DSAA54385.2022.10032360
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Transformer architecture model, based on selfattention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
引用
收藏
页码:173 / 180
页数:8
相关论文
共 50 条
  • [41] Human behavior recognition based on sparse transformer with channel attention mechanism
    Cao, Keyan
    Wang, Mingrui
    FRONTIERS IN PHYSIOLOGY, 2023, 14
  • [42] An Investigation of Positional Encoding in Transformer-based End-to-end Speech Recognition
    Yue, Fengpeng
    Ko, Tom
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [43] A Sparse Transformer-Based Approach for Image Captioning
    Lei, Zhou
    Zhou, Congcong
    Chen, Shengbo
    Huang, Yiyong
    Liu, Xianrui
    IEEE ACCESS, 2020, 8 : 213437 - 213446
  • [44] A Sparse Transformer-Based Approach for Image Captioning
    Lei, Zhou
    Zhou, Congcong
    Chen, Shengbo
    Huang, Yiyong
    Liu, Xianrui
    IEEE Access, 2020, 8 : 213437 - 213446
  • [45] Adaptive Attention for Sparse-based Long-sequence Transformer
    Zhang, Xuanyu
    Lv, Zhepeng
    Yang, Qing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8602 - 8610
  • [46] Adaptive sparse attention-based compact transformer for object tracking
    Pan, Fei
    Zhao, Lianyu
    Wang, Chenglin
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [47] LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
    Fu, Pengbin
    Liu, Daxing
    Yang, Huirong
    INFORMATION, 2022, 13 (05)
  • [48] TRANSFORMER IN ACTION: A COMPARATIVE STUDY OF TRANSFORMER-BASED ACOUSTIC MODELS FOR LARGE SCALE SPEECH RECOGNITION APPLICATIONS
    Wang, Yongqiang
    Shi, Yangyang
    Zhang, Frank
    Wu, Chunyang
    Chan, Julian
    Yeh, Ching-Feng
    Xiao, Alex
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6778 - 6782
  • [49] Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
    Kim, Sehoon
    Gholami, Amir
    Shaw, Albert
    Lee, Nicholas
    Mangalam, Karttikeya
    Malik, Jitendra
    Mahoney, Michael W.
    Keutzer, Kurt
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [50] STREAMING AUTOMATIC SPEECH RECOGNITION WITH THE TRANSFORMER MODEL
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6074 - 6078