MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

被引：0

作者：

Chang, Xuankai ^{[1
]}

Guo, Pengcheng ^{[2
]}

Fujita, Yuya ^{[3
]}

Maekaku, Takashi ^{[3
]}

Watanabe, Shinji ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15217 USA

[2] Northwestern Polytech Univ, Xian 710060, Peoples R China

[3] LY Corp, Tokyo 1028282, Japan

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Automatic speech recognition; deep learning; distant speech processing;

D O I：

10.1109/LSP.2024.3449218

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. Thesemodels, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.

引用

页码：2850 / 2854

页数：5

共 50 条

[41] A NETWORK OF DEEP NEURAL NETWORKS FOR DISTANT SPEECH RECOGNITION
Ravanelli, Mirco
Brakel, Philemon
Omologo, Maurizio
Bengio, Yoshua
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4880 - 4884
[42] Spatio-temporal processing for distant speech recognition
Low, SY
Togneri, R
Nordholm, S
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 1001 - 1004
[43] The potential role of speech production models in automatic speech recognition
Rose, RC
Schroeter, J
Sondhi, MM
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1996, 99 (03): : 1699 - 1709
[44] Potential role of speech production models in automatic speech recognition
J Acoust Soc Am, 3 (1699):
[45] Trapping conversational speech: Extending trap/tandem approaches to conversational telephone speech recognition
Morgan, N
Chen, BY
Zhu, QF
Stolcke, A
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 537 - 540
[46] Exploring the Potential of Prompting Methods in Low-Resource Speech Recognition with Whisper
Chen, Yaqi
Zhang, Wenlin
Zhang, Hao
Yang, Xukui
Qu, Dan
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 382 - 393
[47] Exploring Native and Non-Native English Child Speech Recognition With Whisper
Jain, Rishabh
Barcovschi, Andrei
Yiwere, Mariam Yahayah
Corcoran, Peter
Cucu, Horia
IEEE ACCESS, 2024, 12 : 41601 - 41610
[48] Hidden Markov model training with contaminated speech material for distant-talking speech recognition
Matassoni, M
Omologo, M
Giuliani, D
Svaizer, P
COMPUTER SPEECH AND LANGUAGE, 2002, 16 (02): : 205 - 223
[49] UNet plus plus -Based Multi-Channel Speech Dereverberation and Distant Speech Recognition
Zhao, Tuo
Zhao, Yunxin
Wang, Shaojun
Han, Mei
2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[50] Denoising autoencoder and environment adaptation for distant-talking speech recognition with asynchronous speech recording
Wang, Longbiao
Ren, Bo
Ueda, Yuma
Kai, Atsuhiko
Teraoka, Shunta
Fukushima, Taku
2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,

← 1 2 3 4 5 →