Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

被引:0
|
作者
Zhao, Chendong [1 ,2 ]
Wang, Jianzong [1 ]
Wei, Wenqi [1 ]
Qu, Xiaoyang [1 ]
Wang, Haoqian [2 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Tsinghua Univ, Shenzhen Int Grad Sch, Beijing, Peoples R China
关键词
Automatic Speech Recognition; Sparse Attention; Monotonic Attention; Self-Attention;
D O I
10.1109/DSAA54385.2022.10032360
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Transformer architecture model, based on selfattention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
引用
收藏
页码:173 / 180
页数:8
相关论文
共 50 条
  • [31] Improving Transformer-based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation
    Li, Sheng
    Raj, Dabre
    Lu, Xugang
    Shen, Peng
    Kawahara, Tatsuya
    Kawai, Hisashi
    INTERSPEECH 2019, 2019, : 4400 - 4404
  • [32] Transformer-Based Automatic Speech Recognition with Auxiliary Input of Source Language Text Toward Transcribing Simultaneous Interpretation
    Taniguchi, Shuta
    Kato, Tsuneo
    Tamura, Akihiro
    Yasuda, Keiji
    INTERSPEECH 2022, 2022, : 2813 - 2817
  • [33] Fast offline transformer-based end-to-end automatic speech recognition for real-world applications
    Oh, Yoo Rhee
    Park, Kiyoung
    Park, Jeon Gue
    ETRI JOURNAL, 2022, 44 (03) : 476 - 490
  • [34] A novel transformer-based network with attention mechanism for automatic pavement crack detection
    Guo, Feng
    Liu, Jian
    Lv, Chengshun
    Yu, Huayang
    CONSTRUCTION AND BUILDING MATERIALS, 2023, 391
  • [35] Weak-Attention Suppression For Transformer Based Speech Recognition
    Shi, Yangyang
    Wang, Yongqiang
    Wu, Chunyang
    Fuegen, Christian
    Zhang, Frank
    Le, Duc
    Yeh, Ching-Feng
    Seltzer, Michael L.
    INTERSPEECH 2020, 2020, : 4996 - 5000
  • [36] Sparse Transformer-based bins and Polarized Cross Attention decoder for monocular depth estimation
    Wang, Hai-Kun
    Du, Jiahui
    Song, Ke
    Cui, Limin
    ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2024, 54
  • [37] Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak
    Lehecka, Jan
    Psutka, Josef, V
    Psutka, Josef
    TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 328 - 338
  • [38] TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation
    Wang, Ruotong
    Shen, Yanqing
    Zuo, Weiliang
    Zhou, Sanping
    Zheng, Nanning
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13638 - 13647
  • [39] Transformer-based Long-context End-to-end Speech Recognition
    Hori, Takaaki
    Moritz, Niko
    Hori, Chiori
    Le Roux, Jonathan
    INTERSPEECH 2020, 2020, : 5011 - 5015
  • [40] On-device Streaming Transformer-based End-to-End Speech Recognition
    Oh, Yoo Rhee
    Park, Kiyoung
    INTERSPEECH 2021, 2021, : 967 - 968