Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

被引:6
|
作者
Lohrenz, Timo [1 ]
Li, Zhengyang [1 ]
Fingscheidt, Tim [1 ]
机构
[1] Tech Univ Carolo Wilhelmina Braunschweig, Inst Commun Technol, Schleinitzstr 22, D-38106 Braunschweig, Germany
来源
关键词
End-to-end speech recognition; information fusion; multi-encoder learning; transformer; phase features;
D O I
10.21437/Interspeech.2021-555
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model architectures. Here, we investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer, striving to achieve optimal fusion by investigating different fusion levels in an example single-microphone setting with fusion of standard magnitude and phase features. We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training. Employing then only the magnitude feature encoder in inference, we are able to show consistent improvement on Wall Street Journal (WSJ) with language model and on Librispeech, without increase in runtime or parameters. Combining two such multi-encoder trained models by a simple late fusion in inference, we achieve state-of-the-art performance for transformer-based models on WSJ with a significant WER reduction of 19% relative compared to the current benchmark approach.
引用
收藏
页码:2846 / 2850
页数:5
相关论文
共 50 条
  • [1] A Transformer-Based End-to-End Automatic Speech Recognition Algorithm
    Dong, Fang
    Qian, Yiyang
    Wang, Tianlei
    Liu, Peng
    Cao, Jiuwen
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1592 - 1596
  • [2] An End-to-End Transformer-Based Automatic Speech Recognition for Qur?an Reciters
    Hadwan, Mohammed
    Alsayadi, Hamzah A.
    AL-Hagree, Salah
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (02): : 3471 - 3487
  • [3] Transformer-based Long-context End-to-end Speech Recognition
    Hori, Takaaki
    Moritz, Niko
    Hori, Chiori
    Le Roux, Jonathan
    [J]. INTERSPEECH 2020, 2020, : 5011 - 5015
  • [4] On-device Streaming Transformer-based End-to-End Speech Recognition
    Oh, Yoo Rhee
    Park, Kiyoung
    [J]. INTERSPEECH 2021, 2021, : 967 - 968
  • [5] An Investigation of Positional Encoding in Transformer-based End-to-end Speech Recognition
    Yue, Fengpeng
    Ko, Tom
    [J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [6] Fast offline transformer-based end-to-end automatic speech recognition for real-world applications
    Oh, Yoo Rhee
    Park, Kiyoung
    Park, Jeon Gue
    [J]. ETRI JOURNAL, 2022, 44 (03) : 476 - 490
  • [7] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
    Xu, Menglong
    Li, Shengqiang
    Zhang, Xiao-Lei
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
  • [8] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
    Luo, Haoneng
    Zhang, Shiliang
    Lei, Ming
    Xie, Lei
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
  • [9] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev Orken
    Oralbekova Dina
    Alimhan Keylan
    Turdalykyzy Tolganay
    Othman Mohamed
    [J]. Scientific Reports, 12
  • [10] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev, Orken
    Oralbekova, Dina
    Alimhan, Keylan
    Turdalykyzy, Tolganay
    Othman, Mohamed
    [J]. SCIENTIFIC REPORTS, 2022, 12 (01)