Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

被引：6

作者：

Lohrenz, Timo ^{[1
]}

Li, Zhengyang ^{[1
]}

Fingscheidt, Tim ^{[1
]}

机构：

[1] Tech Univ Carolo Wilhelmina Braunschweig, Inst Commun Technol, Schleinitzstr 22, D-38106 Braunschweig, Germany

来源：

INTERSPEECH 2021 | 2021年

关键词：

End-to-end speech recognition; information fusion; multi-encoder learning; transformer; phase features;

D O I：

10.21437/Interspeech.2021-555

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model architectures. Here, we investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer, striving to achieve optimal fusion by investigating different fusion levels in an example single-microphone setting with fusion of standard magnitude and phase features. We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training. Employing then only the magnitude feature encoder in inference, we are able to show consistent improvement on Wall Street Journal (WSJ) with language model and on Librispeech, without increase in runtime or parameters. Combining two such multi-encoder trained models by a simple late fusion in inference, we achieve state-of-the-art performance for transformer-based models on WSJ with a significant WER reduction of 19% relative compared to the current benchmark approach.

引用

页码：2846 / 2850

页数：5

共 50 条

[21] Spectrograms Fusion-based End-to-end Robust Automatic Speech Recognition
Shi, Hao
Wang, Longbiao
Li, Sheng
Fang, Cunhang
Dang, Jianwu
Kawahara, Tatsuya
[J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 438 - 442
[22] End-to-End Speech Recognition Technology Based on Multi-Stream CNN
Xiao, Hao
Qiu, Yuan
Fei, Rong
Chen, Xiongbo
Liu, Zuo
Wu, Zongling
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1310 - 1315
[23] Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention
Liang, Chengdong
Xu, Menglong
Zhang, Xiao-Lei
[J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 2 : 1495 - 1499
[24] Transformer-Based End-to-End Speech Translation With Rotary Position Embedding
Li, Xueqing
Li, Shengqiang
Zhang, Xiao-Lei
Rahardja, Susanto
[J]. IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 371 - 375
[25] End-to-End Automatic Speech Recognition with Deep Mutual Learning
Masumura, Ryo
Ihori, Mana
Takashima, Akihiko
Tanaka, Tomohiro
Ashihara, Takanori
[J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 632 - 637
[26] Continual Learning for Monolingual End-to-End Automatic Speech Recognition
Vander Eeckt, Steven
Van Hamme, Hugo
[J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 459 - 463
[27] End-to-end information fusion method for transformer-based stereo matching
Xu, Zhenghui
Wang, Jingxue
Guo, Jun
[J]. MEASUREMENT SCIENCE AND TECHNOLOGY, 2024, 35 (06)
[28] Semantic Mask for Transformer based End-to-End Speech Recognition
Wang, Chengyi
Wu, Yu
Du, Yujiao
Li, Jinyu
Liu, Shujie
Lu, Liang
Ren, Shuo
Ye, Guoli
Zhao, Sheng
Zhou, Ming
[J]. INTERSPEECH 2020, 2020, : 971 - 975
[29] Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
Karita, Shigeki
Soplin, Nelson Enrique Yalta
Watanabe, Shinji
Delcroix, Marc
Ogawa, Atsunori
Nakatani, Tomohiro
[J]. INTERSPEECH 2019, 2019, : 1408 - 1412
[30] STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION
Wang, Xiaofei
Li, Ruizhi
Mallidi, Sri Harish
Hori, Takaaki
Watanabe, Shinji
Hermansky, Hynek
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7105 - 7109

← 1 2 3 4 5 →