MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

被引:0
|
作者
Chang, Xuankai [1 ]
Guo, Pengcheng [2 ]
Fujita, Yuya [3 ]
Maekaku, Takashi [3 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15217 USA
[2] Northwestern Polytech Univ, Xian 710060, Peoples R China
[3] LY Corp, Tokyo 1028282, Japan
关键词
Automatic speech recognition; deep learning; distant speech processing;
D O I
10.1109/LSP.2024.3449218
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. Thesemodels, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.
引用
收藏
页码:2850 / 2854
页数:5
相关论文
共 50 条
  • [21] VERSATILE VECTOR PROCESSOR FOR MULTICHANNEL SPEECH RECOGNITION
    OSBORN, RR
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 : S132 - S132
  • [22] SPEECH RECOGNITION EXPERIENCE WITH MULTICHANNEL COCHLEAR IMPLANTS
    PARKIN, JL
    EDDINGTON, DK
    ORTH, JL
    BRACKMANN, DE
    OTOLARYNGOLOGY-HEAD AND NECK SURGERY, 1985, 93 (05) : 639 - 645
  • [23] Learning to Rank Microphones for Distant Speech Recognition
    Cornell, Samuele
    Brutti, Alessio
    Matassoni, Marco
    Squartini, Stefano
    INTERSPEECH 2021, 2021, : 3855 - 3859
  • [24] Multichannel End-to-end Speech Recognition
    Ochiai, Tsubasa
    Watanabe, Shinji
    Hori, Takaaki
    Hershey, John R.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [25] Microphone Array Processing for Distant Speech Recognition
    Kumatani, Kenichi
    McDonough, John
    Raj, Bhiksha
    IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 127 - 140
  • [26] Strategies for distant speech recognition in reverberant environments
    Delcroix, Marc
    Yoshioka, Takuya
    Ogawa, Atsunori
    Kubo, Yotaro
    Fujimoto, Masakiyo
    Ito, Nobutaka
    Kinoshita, Keisuke
    Espi, Miquel
    Araki, Shoko
    Hori, Takaaki
    Nakatani, Tomohiro
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015,
  • [27] Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking
    Astapov, Sergei
    Svirskiy, Gleb
    Lavrentyev, Aleksandr
    Prisyach, Tatyana
    Popov, Dmitriy
    Ubskiy, Dmitriy
    Kabarov, Vladimir
    SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 31 - 42
  • [28] Speech Recognition Program as Models of Speech Perception.
    Vaissiere, J.
    Recherches/Acoustique, 1980, 6 : 205 - 206
  • [29] Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition
    Ghaffarzadegan, Shabnam
    Boril, Hynek
    Hansen, John H. L.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (10) : 1705 - 1720
  • [30] Lipreading Approach for Isolated Digits Recognition under Whisper and Neutral Speech
    Tao, Fei
    Busso, Carlos
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1154 - 1158