Towards Efficiently Learning Monotonic Alignments for Attention-Based End-to-End Speech Recognition

被引:0
|
作者
Miao, Chenfeng [1 ]
Zou, Kun [1 ]
Zhuang, Ziyang [1 ]
Wei, Tao [1 ]
Ma, Jun [1 ]
Wang, Shaojun [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol, Shenzhen, Peoples R China
来源
关键词
E2E speech recognition; monotonic alignment; attention model;
D O I
10.21437/Interspeech.2022-11259
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Inspired by EfficientTTS [1], a recent proposed speech synthesis model, we propose a new way to train attention-based end-to-end speech recognition models with an additional training objective, allowing the models to learn the monotonic alignments effectively and efficiently. The introduced training objective is differentiable, computationally cheap and most importantly, of no constraint on network structures. Thus, it is quite convenient to be incorporated into many speech recognition models. Through extensive experiments on CTC/Attetion architecture with conformer blocks, we observed that the performance of our models significantly outperform baseline models. Specifically, our best performing model achieves WER (Word Error Rate) 3.18% on LibriSpeech test-clean benchmark and 8.41% on test-other. Comparing with a strong baseline obtained by WeNet, the proposed model gets 7.6% relative WER reduction on test-clean and 6.9% on test-other.
引用
收藏
页码:1051 / 1055
页数:5
相关论文
共 50 条
  • [1] END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION
    Bandanau, Dzmitry
    Chorowski, Jan
    Serdyuk, Dmitriy
    Brakel, Philemon
    Bengio, Yoshua
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 4945 - 4949
  • [2] Speaker Adaptation for Attention-Based End-to-End Speech Recognition
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2019, 2019, : 241 - 245
  • [3] ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH
    Shan, Changhao
    Zhang, Junbo
    Wang, Yujun
    Xie, Lei
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4764 - 4768
  • [4] CHARACTER-AWARE ATTENTION-BASED END-TO-END SPEECH RECOGNITION
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 949 - 955
  • [5] AN ANALYSIS OF DECODING FOR ATTENTION-BASED END-TO-END MANDARIN SPEECH RECOGNITION
    Jiang, Dongwei
    Zou, Wei
    Zhao, Shuaijiang
    Yang, Guilin
    Li, Xiangang
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 384 - 388
  • [6] EXPLICIT ALIGNMENT OF TEXT AND SPEECH ENCODINGS FOR ATTENTION-BASED END-TO-END SPEECH RECOGNITION
    Drexler, Jennifer
    Glass, James
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 913 - 919
  • [7] STREAMING ATTENTION-BASED MODELS WITH AUGMENTED MEMORY FOR END-TO-END SPEECH RECOGNITION
    Yeh, Ching-Feng
    Wang, Yongqiang
    Shi, Yangyang
    Wu, Chunyang
    Zhang, Frank
    Chan, Julian
    Seltzer, Michael L.
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 8 - 14
  • [8] STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION
    Wang, Xiaofei
    Li, Ruizhi
    Mallidi, Sri Harish
    Hori, Takaaki
    Watanabe, Shinji
    Hermansky, Hynek
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7105 - 7109
  • [9] Towards end-to-end speech recognition with transfer learning
    Chu-Xiong Qin
    Dan Qu
    Lian-Hai Zhang
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2018
  • [10] Towards end-to-end speech recognition with transfer learning
    Qin, Chu-Xiong
    Qu, Dan
    Zhang, Lian-Hai
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2018,