Online Compressive Transformer for End-to-End Speech Recognition

被引:8
|
作者
Leong, Chi-Hang [1 ]
Huang, Yu-Han [1 ]
Chien, Jen-Tzung [1 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Dept Elect & Comp Engn, Taipei, Taiwan
来源
关键词
Online processing and learning; compressive transformer; end-to-end speech recognition; SELF-ATTENTION;
D O I
10.21437/Interspeech.2021-545
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Traditionally, transformer with connectionist temporal classification (CTC) was developed for offline speech recognition where the transcription was generated after the whole utterance has been spoken. However, it is crucial to carry out online transcription of speech signal for many applications including live broadcasting and meeting. This paper presents an online transformer for real-time speech recognition where online transcription is generated chunk by chuck. In particular, an online compressive transformer (OCT) is proposed for end-to-end speech recognition. This OCT aims to generate immediate transcription for each audio chunk while the comparable performance with offline speech recognition can be still achieved. In the implementation, OCT tightly combines with both CTC and recurrent neural network transducer by minimizing their losses for training. In addition, this OCT systematically merges with compressive memory to reduce potential performance degradation due to online processing. This degradation is caused by online transcription which is generated by the chunks without history information. The experiments on speech recognition show that OCT does not only obtain comparable performance with offline transformer, but also work faster than the baseline model.
引用
收藏
页码:2082 / 2086
页数:5
相关论文
共 50 条
  • [1] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
    Miao, Haoran
    Cheng, Gaofeng
    Gao, Changfeng
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088
  • [2] Semantic Mask for Transformer based End-to-End Speech Recognition
    Wang, Chengyi
    Wu, Yu
    Du, Yujiao
    Li, Jinyu
    Liu, Shujie
    Lu, Liang
    Ren, Shuo
    Ye, Guoli
    Zhao, Sheng
    Zhou, Ming
    [J]. INTERSPEECH 2020, 2020, : 971 - 975
  • [3] END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION
    Chang, Feng-Ju
    Radfar, Martin
    Mouchtaris, Athanasios
    King, Brian
    Kunzmann, Siegfried
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5884 - 5888
  • [4] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
  • [5] Improved training for online end-to-end speech recognition systems
    Kim, Suyoun
    Seltzer, Michael L.
    Li, Jinyu
    Zhao, Rui
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2913 - 2917
  • [6] Online Continual Learning of End-to-End Speech Recognition Models
    Yang, Muqiao
    Lane, Ian
    Watanabe, Shinji
    [J]. INTERSPEECH 2022, 2022, : 2668 - 2672
  • [7] Transformer Model Compression for End-to-End Speech Recognition on Mobile Devices
    Ben Letaifa, Leila
    Rouas, Jean-Luc
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 439 - 443
  • [8] Simple Data Augmented Transformer End-To-End Tibetan Speech Recognition
    Yang, Xiaodong
    Wang, Weizhe
    Yang, Hongwu
    Jiang, Jiaolong
    [J]. 2020 IEEE 3RD INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP 2020), 2020, : 148 - 152
  • [9] A Transformer-Based End-to-End Automatic Speech Recognition Algorithm
    Dong, Fang
    Qian, Yiyang
    Wang, Tianlei
    Liu, Peng
    Cao, Jiuwen
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1592 - 1596
  • [10] Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition
    Wang, Qinyi
    Zhou, Xinyuan
    Li, Haizhou
    [J]. APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (01)