MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

被引：0

作者：

Chang, Xuankai ^{[1
]}

Guo, Pengcheng ^{[2
]}

Fujita, Yuya ^{[3
]}

Maekaku, Takashi ^{[3
]}

Watanabe, Shinji ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15217 USA

[2] Northwestern Polytech Univ, Xian 710060, Peoples R China

[3] LY Corp, Tokyo 1028282, Japan

来源：

IEEE SIGNAL PROCESSING LETTERS | 2024年 / 31卷

关键词：

Automatic speech recognition; deep learning; distant speech processing;

D O I：

10.1109/LSP.2024.3449218

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. Thesemodels, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.

引用

页码：2850 / 2854

页数：5

共 50 条

[31] N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition
Talafha, Bashar
Waheed, Abdul
Abdul-Mageed, Muhammad
INTERSPEECH 2023, 2023, : 5092 - 5096
[32] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Rouditchenko, Andrew
Gong, Yuan
Thomas, Samuel
Karlinsky, Leonid
Kuehne, Hilde
Feris, Rogerio
Glass, James
INTERSPEECH 2024, 2024, : 2420 - 2424
[33] On the selection of the impulse responses for distant-speech recognition based on contaminated speech training
Ravanelli, Mirco
Omologo, Maurizio
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1028 - 1032
[34] Contaminated speech training methods for robust DNN-HMM distant speech recognition
Ravanelli, Mirco
Omologo, Maurizio
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 756 - 760
[35] SPEECH RECOGNITION IN DEAF SUBJECTS WITH MULTICHANNEL INTRACOCHLEAR ELECTRODES
EDDINGTON, DK
ANNALS OF THE NEW YORK ACADEMY OF SCIENCES, 1983, 405 (JUN) : 241 - 258
[36] Distant Speech Recognition Using a Microphone Array Network
Nakano, Alberto Yoshihiro
Nakagawa, Seiichi
Yamamoto, Kazumasa
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2010, E93D (09): : 2451 - 2462
[37] Improve Multichannel Speech Recognition with Temporal and Spatial Information
Zhang, Yu
Zhang, Pengyuan
Zhao, Qingwei
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (07) : 1963 - 1967
[38] Automatic context window composition for distant speech recognition
Ravanelli, Mirco
Omologo, Maurizio
SPEECH COMMUNICATION, 2018, 101 : 34 - 44
[39] Multichannel NMF with Reduced Computational Complexity for Speech Recognition
Izumi, Taiki
Uramoto, Takanobu
Uenohara, Shingo
Furuya, Ken'ichi
Aihara, Ryo
Hanazawa, Toshiyuki
Okato, Yohei
2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 192 - 195
[40] On Comparison of Deep Learning Architectures for Distant Speech Recognition
Sustika, Rika
Yuliani, Asri R.
Zaenudin, Efendi
Pardede, Hilman F.
2017 2ND INTERNATIONAL CONFERENCES ON INFORMATION TECHNOLOGY, INFORMATION SYSTEMS AND ELECTRICAL ENGINEERING (ICITISEE): OPPORTUNITIES AND CHALLENGES ON BIG DATA FUTURE INNOVATION, 2017, : 17 - 21

← 1 2 3 4 5 →