MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

被引:0
|
作者
Chang, Xuankai [1 ]
Guo, Pengcheng [2 ]
Fujita, Yuya [3 ]
Maekaku, Takashi [3 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15217 USA
[2] Northwestern Polytech Univ, Xian 710060, Peoples R China
[3] LY Corp, Tokyo 1028282, Japan
关键词
Automatic speech recognition; deep learning; distant speech processing;
D O I
10.1109/LSP.2024.3449218
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. Thesemodels, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.
引用
收藏
页码:2850 / 2854
页数:5
相关论文
共 50 条
  • [41] A NETWORK OF DEEP NEURAL NETWORKS FOR DISTANT SPEECH RECOGNITION
    Ravanelli, Mirco
    Brakel, Philemon
    Omologo, Maurizio
    Bengio, Yoshua
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4880 - 4884
  • [42] Spatio-temporal processing for distant speech recognition
    Low, SY
    Togneri, R
    Nordholm, S
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 1001 - 1004
  • [43] The potential role of speech production models in automatic speech recognition
    Rose, RC
    Schroeter, J
    Sondhi, MM
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1996, 99 (03): : 1699 - 1709
  • [45] Trapping conversational speech: Extending trap/tandem approaches to conversational telephone speech recognition
    Morgan, N
    Chen, BY
    Zhu, QF
    Stolcke, A
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 537 - 540
  • [46] Exploring the Potential of Prompting Methods in Low-Resource Speech Recognition with Whisper
    Chen, Yaqi
    Zhang, Wenlin
    Zhang, Hao
    Yang, Xukui
    Qu, Dan
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 382 - 393
  • [47] Exploring Native and Non-Native English Child Speech Recognition With Whisper
    Jain, Rishabh
    Barcovschi, Andrei
    Yiwere, Mariam Yahayah
    Corcoran, Peter
    Cucu, Horia
    IEEE ACCESS, 2024, 12 : 41601 - 41610
  • [48] Hidden Markov model training with contaminated speech material for distant-talking speech recognition
    Matassoni, M
    Omologo, M
    Giuliani, D
    Svaizer, P
    COMPUTER SPEECH AND LANGUAGE, 2002, 16 (02): : 205 - 223
  • [49] UNet plus plus -Based Multi-Channel Speech Dereverberation and Distant Speech Recognition
    Zhao, Tuo
    Zhao, Yunxin
    Wang, Shaojun
    Han, Mei
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [50] Denoising autoencoder and environment adaptation for distant-talking speech recognition with asynchronous speech recording
    Wang, Longbiao
    Ren, Bo
    Ueda, Yuma
    Kai, Atsuhiko
    Teraoka, Shunta
    Fukushima, Taku
    2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,