MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

被引：0

作者：

Chang, Xuankai ^{[1
]}

Guo, Pengcheng ^{[2
]}

Fujita, Yuya ^{[3
]}

Maekaku, Takashi ^{[3
]}

Watanabe, Shinji ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15217 USA

[2] Northwestern Polytech Univ, Xian 710060, Peoples R China

[3] LY Corp, Tokyo 1028282, Japan

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Automatic speech recognition; deep learning; distant speech processing;

D O I：

10.1109/LSP.2024.3449218

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. Thesemodels, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.

引用

页码：2850 / 2854

页数：5

共 50 条

[21] VERSATILE VECTOR PROCESSOR FOR MULTICHANNEL SPEECH RECOGNITION
OSBORN, RR
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 : S132 - S132
[22] SPEECH RECOGNITION EXPERIENCE WITH MULTICHANNEL COCHLEAR IMPLANTS
PARKIN, JL
EDDINGTON, DK
ORTH, JL
BRACKMANN, DE
OTOLARYNGOLOGY-HEAD AND NECK SURGERY, 1985, 93 (05) : 639 - 645
[23] Learning to Rank Microphones for Distant Speech Recognition
Cornell, Samuele
Brutti, Alessio
Matassoni, Marco
Squartini, Stefano
INTERSPEECH 2021, 2021, : 3855 - 3859
[24] Multichannel End-to-end Speech Recognition
Ochiai, Tsubasa
Watanabe, Shinji
Hori, Takaaki
Hershey, John R.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[25] Microphone Array Processing for Distant Speech Recognition
Kumatani, Kenichi
McDonough, John
Raj, Bhiksha
IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 127 - 140
[26] Strategies for distant speech recognition in reverberant environments
Delcroix, Marc
Yoshioka, Takuya
Ogawa, Atsunori
Kubo, Yotaro
Fujimoto, Masakiyo
Ito, Nobutaka
Kinoshita, Keisuke
Espi, Miquel
Araki, Shoko
Hori, Takaaki
Nakatani, Tomohiro
EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015,
[27] Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking
Astapov, Sergei
Svirskiy, Gleb
Lavrentyev, Aleksandr
Prisyach, Tatyana
Popov, Dmitriy
Ubskiy, Dmitriy
Kabarov, Vladimir
SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 31 - 42
[28] Speech Recognition Program as Models of Speech Perception.
Vaissiere, J.
Recherches/Acoustique, 1980, 6 : 205 - 206
[29] Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition
Ghaffarzadegan, Shabnam
Boril, Hynek
Hansen, John H. L.
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (10) : 1705 - 1720
[30] Lipreading Approach for Isolated Digits Recognition under Whisper and Neutral Speech
Tao, Fei
Busso, Carlos
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1154 - 1158

← 1 2 3 4 5 →