JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING

被引:0
|
作者
Kim, Suyoun [1 ,2 ]
Hori, Takaaki [1 ]
Watanabe, Shinji [1 ]
机构
[1] Mitsubishi Elect Res Labs, Cambridge, MA 02139 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
end-to-end; speech recognition; connectionist temporal classification; attention; multi-task learning;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any pre-defined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4-14.6% relative improvements in Character Error Rate (CER).
引用
收藏
页码:4835 / 4839
页数:5
相关论文
共 50 条
  • [1] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINT CTC-ATTENTION BASED MODELS
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 936 - 943
  • [2] Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition
    Markovnikov, Nikita
    Kipyatkova, Irina
    [J]. SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 337 - 347
  • [3] Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units
    Xiao, Zhangyu
    Ou, Zhijian
    Chu, Wei
    Lin, Hui
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 146 - 150
  • [4] Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder
    Zhu T.
    Cheng C.
    [J]. Journal of Shanghai Jiaotong University (Science), 2020, 25 (01) : 70 - 75
  • [5] Improved CTC-Attention Based End-to-End Speech Recognition on Air Traffic Control
    Zhou, Kai
    Yang, Qun
    Sun, XiuSong
    Liu, ShaoHan
    Lu, JinJun
    [J]. INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: BIG DATA AND MACHINE LEARNING, PT II, 2019, 11936 : 187 - 196
  • [6] DISTILLING KNOWLEDGE FROM ENSEMBLES OF ACOUSTIC MODELS FOR JOINT CTC-ATTENTION END-TO-END SPEECH RECOGNITION
    Gao, Yan
    Parcollet, Titouan
    Lane, Nicholas D.
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 138 - 145
  • [7] Joint CTC/attention decoding for end-to-end speech recognition
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
  • [8] Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
    Hari, Takaaki
    Watanabe, Shinji
    Zhang, Yu
    Chan, William
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 949 - 953
  • [9] Multi-task CTC Training with Auxiliary Feature Reconstruction for End-to-end Speech Recognition
    Kurata, Gakuto
    Audhkhasi, Kartik
    [J]. INTERSPEECH 2019, 2019, : 1636 - 1640
  • [10] End-to-End Multi-Task Learning with Attention
    Liu, Shikun
    Johns, Edward
    Davison, Andrew J.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1871 - 1880