Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

被引:0
|
作者
Li, Zehan [1 ,2 ]
Miao, Haoran [1 ,2 ]
Deng, Keqi [1 ,2 ]
Cheng, Gaofeng [1 ]
Tian, Sanli [1 ,2 ]
Li, Ta [1 ,2 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
关键词
Streaming ASR; Causal model; Transformer; Encoder states revision; SPEECH RECOGNITION;
D O I
10.21437/Interspeech.2022-707
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
There is often a trade-off between performance and latency in streaming automatic speech recognition (ASR). Traditional methods such as look-ahead and chunk-based methods, usually require information from future frames to advance recognition accuracy, which incurs inevitable latency even if the computation is fast enough. A causal model that computes without any future frames can avoid this latency, but its performance is significantly worse than traditional methods. In this paper, we propose corresponding revision strategies to improve the causal model. Firstly, we introduce a real-time encoder states revision strategy to modify previous states. Encoder forward computation starts once the data is received and revises the previous encoder states after several frames, which is no need to wait for any right context. Furthermore, a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the proposed revision strategy. Experiments are all conducted on Librispeech datasets. Fine-tuning on the CTC-based wav2vec2.0 model, our best method can achieve 3.7/9.2 WERs on test-clean/other sets and brings 45% relative improvement for causal models, which is also competitive with the chunk-based methods and the knowledge distillation methods.
引用
下载
收藏
页码:1671 / 1675
页数:5
相关论文
共 50 条
  • [21] Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
    Karita, Shigeki
    Soplin, Nelson Enrique Yalta
    Watanabe, Shinji
    Delcroix, Marc
    Ogawa, Atsunori
    Nakatani, Tomohiro
    INTERSPEECH 2019, 2019, : 1408 - 1412
  • [22] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
    Xu, Menglong
    Li, Shengqiang
    Zhang, Xiao-Lei
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
  • [23] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev Orken
    Oralbekova Dina
    Alimhan Keylan
    Turdalykyzy Tolganay
    Othman Mohamed
    Scientific Reports, 12
  • [24] TMSS: An End-to-End Transformer-Based Multimodal Network for Segmentation and Survival Prediction
    Saeed, Numan
    Sobirov, Ikboljon
    Al Majzoub, Roba
    Yaqub, Mohammad
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 319 - 329
  • [25] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev, Orken
    Oralbekova, Dina
    Alimhan, Keylan
    Turdalykyzy, Tolganay
    Othman, Mohamed
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [26] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
    Miao, Haoran
    Cheng, Gaofeng
    Gao, Changfeng
    Zhang, Pengyuan
    Yan, Yonghong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088
  • [27] Transformer-based end-to-end attack on text CAPTCHAs with triplet deep attention
    Zhang, Bo
    Xiong, Yu-Jie
    Xia, Chunming
    Gao, Yongbin
    COMPUTERS & SECURITY, 2024, 146
  • [28] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
    Luo, Haoneng
    Zhang, Shiliang
    Lei, Ming
    Xie, Lei
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
  • [29] Transformer-Based End-to-End Classification of Variable-Length Volumetric Data
    Oghbaie, Marzieh
    Araujo, Teresa
    Emre, Taha
    Schmidt-Erfurth, Ursula
    Bogunovic, Hrvoje
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VI, 2023, 14225 : 358 - 367
  • [30] TOD-Net: An end-to-end transformer-based object detection network
    Sirisha, Museboyina
    Sudha, S. V.
    COMPUTERS & ELECTRICAL ENGINEERING, 2023, 108