Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

被引：6

作者：

Lohrenz, Timo ^{[1
]}

Li, Zhengyang ^{[1
]}

Fingscheidt, Tim ^{[1
]}

机构：

[1] Tech Univ Carolo Wilhelmina Braunschweig, Inst Commun Technol, Schleinitzstr 22, D-38106 Braunschweig, Germany

来源：

INTERSPEECH 2021 | 2021年

关键词：

End-to-end speech recognition; information fusion; multi-encoder learning; transformer; phase features;

D O I：

10.21437/Interspeech.2021-555

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model architectures. Here, we investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer, striving to achieve optimal fusion by investigating different fusion levels in an example single-microphone setting with fusion of standard magnitude and phase features. We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training. Employing then only the magnitude feature encoder in inference, we are able to show consistent improvement on Wall Street Journal (WSJ) with language model and on Librispeech, without increase in runtime or parameters. Combining two such multi-encoder trained models by a simple late fusion in inference, we achieve state-of-the-art performance for transformer-based models on WSJ with a significant WER reduction of 19% relative compared to the current benchmark approach.

引用

页码：2846 / 2850

页数：5

共 50 条

[1] A Transformer-Based End-to-End Automatic Speech Recognition Algorithm
Dong, Fang
Qian, Yiyang
Wang, Tianlei
Liu, Peng
Cao, Jiuwen
[J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1592 - 1596
[2] An End-to-End Transformer-Based Automatic Speech Recognition for Qur?an Reciters
Hadwan, Mohammed
Alsayadi, Hamzah A.
AL-Hagree, Salah
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (02): : 3471 - 3487
[3] Transformer-based Long-context End-to-end Speech Recognition
Hori, Takaaki
Moritz, Niko
Hori, Chiori
Le Roux, Jonathan
[J]. INTERSPEECH 2020, 2020, : 5011 - 5015
[4] On-device Streaming Transformer-based End-to-End Speech Recognition
Oh, Yoo Rhee
Park, Kiyoung
[J]. INTERSPEECH 2021, 2021, : 967 - 968
[5] An Investigation of Positional Encoding in Transformer-based End-to-end Speech Recognition
Yue, Fengpeng
Ko, Tom
[J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[6] Fast offline transformer-based end-to-end automatic speech recognition for real-world applications
Oh, Yoo Rhee
Park, Kiyoung
Park, Jeon Gue
[J]. ETRI JOURNAL, 2022, 44 (03) : 476 - 490
[7] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
Xu, Menglong
Li, Shengqiang
Zhang, Xiao-Lei
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
[8] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
Luo, Haoneng
Zhang, Shiliang
Lei, Ming
Xie, Lei
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
[9] A study of transformer-based end-to-end speech recognition system for Kazakh language
Mamyrbayev Orken
Oralbekova Dina
Alimhan Keylan
Turdalykyzy Tolganay
Othman Mohamed
[J]. Scientific Reports, 12
[10] A study of transformer-based end-to-end speech recognition system for Kazakh language
Mamyrbayev, Orken
Oralbekova, Dina
Alimhan, Keylan
Turdalykyzy, Tolganay
Othman, Mohamed
[J]. SCIENTIFIC REPORTS, 2022, 12 (01)

← 1 2 3 4 5 →