Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

被引:0
|
作者
Li, Zehan [1 ,2 ]
Miao, Haoran [1 ,2 ]
Deng, Keqi [1 ,2 ]
Cheng, Gaofeng [1 ]
Tian, Sanli [1 ,2 ]
Li, Ta [1 ,2 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
关键词
Streaming ASR; Causal model; Transformer; Encoder states revision; SPEECH RECOGNITION;
D O I
10.21437/Interspeech.2022-707
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
There is often a trade-off between performance and latency in streaming automatic speech recognition (ASR). Traditional methods such as look-ahead and chunk-based methods, usually require information from future frames to advance recognition accuracy, which incurs inevitable latency even if the computation is fast enough. A causal model that computes without any future frames can avoid this latency, but its performance is significantly worse than traditional methods. In this paper, we propose corresponding revision strategies to improve the causal model. Firstly, we introduce a real-time encoder states revision strategy to modify previous states. Encoder forward computation starts once the data is received and revises the previous encoder states after several frames, which is no need to wait for any right context. Furthermore, a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the proposed revision strategy. Experiments are all conducted on Librispeech datasets. Fine-tuning on the CTC-based wav2vec2.0 model, our best method can achieve 3.7/9.2 WERs on test-clean/other sets and brings 45% relative improvement for causal models, which is also competitive with the chunk-based methods and the knowledge distillation methods.
引用
收藏
页码:1671 / 1675
页数:5
相关论文
共 50 条
  • [1] On-device Streaming Transformer-based End-to-End Speech Recognition
    Oh, Yoo Rhee
    Park, Kiyoung
    [J]. INTERSPEECH 2021, 2021, : 967 - 968
  • [2] Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR
    Maekaku, Takashi
    Fujita, Yuya
    Peng, Yifan
    Watanabe, Shinji
    [J]. INTERSPEECH 2022, 2022, : 1071 - 1075
  • [3] Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR
    Maekaku, Takashi
    Fujita, Yuya
    Peng, Yifan
    Watanabe, Shinji
    [J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, 2022-September : 1071 - 1075
  • [4] End-to-End Transformer-Based Models in Textual-Based NLP
    Rahali, Abir
    Akhloufi, Moulay A.
    [J]. AI, 2023, 4 (01) : 54 - 110
  • [5] Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models
    Wang, Tianzi
    Fujita, Yuya
    Chang, Xuankai
    Watanabe, Shinji
    [J]. INTERSPEECH 2021, 2021, : 3755 - 3759
  • [6] COMPARATIVE STUDY OF DIFFERENT TOKENIZATION STRATEGIES FOR STREAMING END-TO-END ASR
    Singh, Sachin
    Gupta, Ashutosh
    Maghan, Aman
    Gowda, Dhananjaya
    Singh, Shatrughan
    Kim, Chanwoo
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 388 - 394
  • [7] Transformer-based end-to-end scene text recognition
    Zhu, Xinghao
    Zhang, Zhi
    [J]. PROCEEDINGS OF THE 2021 IEEE 16TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA 2021), 2021, : 1691 - 1695
  • [8] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
    Lohrenz, Timo
    Li, Zhengyang
    Fingscheidt, Tim
    [J]. INTERSPEECH 2021, 2021, : 2846 - 2850
  • [9] Hierarchical transformer-based large-context end-to-end ASR with large-context knowledge distillation
    Masumura, Ryo
    Makishima, Naoki
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Orihashi, Shota
    [J]. arXiv, 2021,
  • [10] HIERARCHICAL TRANSFORMER-BASED LARGE-CONTEXT END-TO-END ASR WITH LARGE-CONTEXT KNOWLEDGE DISTILLATION
    Masumura, Ryo
    Makishima, Naoki
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Orihashi, Shota
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5879 - 5883