MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

被引：0

作者：

Chang, Xuankai ^{[1
]}

Guo, Pengcheng ^{[2
]}

Fujita, Yuya ^{[3
]}

Maekaku, Takashi ^{[3
]}

Watanabe, Shinji ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15217 USA

[2] Northwestern Polytech Univ, Xian 710060, Peoples R China

[3] LY Corp, Tokyo 1028282, Japan

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Automatic speech recognition; deep learning; distant speech processing;

D O I：

10.1109/LSP.2024.3449218

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. Thesemodels, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.

引用

页码：2850 / 2854

页数：5

共 50 条

[1] HYBRID ACOUSTIC MODELS FOR DISTANT AND MULTICHANNEL LARGE VOCABULARY SPEECH RECOGNITION
Swietojanski, Pawel
Ghoshal, Arnab
Renals, Steve
2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 285 - 290
[2] Adaptation of Whisper models to child speech recognition
Jain, Rishabh
Barcovschi, Andrei
Yiwere, Mariam
Corcoran, Peter
Cucu, Horia
INTERSPEECH 2023, 2023, : 5242 - 5246
[3] Improved Frequency Modulation Features for Multichannel Distant Speech Recognition
Rodomagoulakis, Isidoros
Maragos, Petros
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 841 - 849
[4] The whisper test and speech recognition tests
Dick, Finlay
OCCUPATIONAL MEDICINE-OXFORD, 2018, 68 (07): : 488 - 489
[5] Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition
Agrawal, Vikas
Kumar, Shashi
Rath, Shakti P.
INTERSPEECH 2021, 2021, : 2706 - 2710
[6] Search organization in the whisper continuous speech recognition system
Alleva, F
1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, : 295 - 302
[7] Distant speech recognition:: Bridging the gaps
McDonough, John
Woelfel, Matthias
2008 HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS, 2008, : 109 - +
[8] NEURAL NETWORKS FOR DISTANT SPEECH RECOGNITION
Renals, Steve
Swietojanski, Pawel
2014 4TH JOINT WORKSHOP ON HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS (HSCMA), 2014, : 172 - 176
[9] On distant speech recognition for home automation
Vacher, Michel
Lecouteux, Benjamin
Portet, François
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2015, 8700 : 161 - 188
[10] Single-Channel Speech Enhancement Techniques for Distant Speech Recognition
Ashwini, Jaya
Kumaraswamy, Ramaswamy
JOURNAL OF INTELLIGENT SYSTEMS, 2013, 22 (02) : 81 - 93

← 1 2 3 4 5 →