MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

被引:0
|
作者
Chang, Xuankai [1 ]
Guo, Pengcheng [2 ]
Fujita, Yuya [3 ]
Maekaku, Takashi [3 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15217 USA
[2] Northwestern Polytech Univ, Xian 710060, Peoples R China
[3] LY Corp, Tokyo 1028282, Japan
关键词
Automatic speech recognition; deep learning; distant speech processing;
D O I
10.1109/LSP.2024.3449218
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. Thesemodels, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.
引用
收藏
页码:2850 / 2854
页数:5
相关论文
共 50 条
  • [1] HYBRID ACOUSTIC MODELS FOR DISTANT AND MULTICHANNEL LARGE VOCABULARY SPEECH RECOGNITION
    Swietojanski, Pawel
    Ghoshal, Arnab
    Renals, Steve
    2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 285 - 290
  • [2] Adaptation of Whisper models to child speech recognition
    Jain, Rishabh
    Barcovschi, Andrei
    Yiwere, Mariam
    Corcoran, Peter
    Cucu, Horia
    INTERSPEECH 2023, 2023, : 5242 - 5246
  • [3] Improved Frequency Modulation Features for Multichannel Distant Speech Recognition
    Rodomagoulakis, Isidoros
    Maragos, Petros
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 841 - 849
  • [4] The whisper test and speech recognition tests
    Dick, Finlay
    OCCUPATIONAL MEDICINE-OXFORD, 2018, 68 (07): : 488 - 489
  • [5] Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition
    Agrawal, Vikas
    Kumar, Shashi
    Rath, Shakti P.
    INTERSPEECH 2021, 2021, : 2706 - 2710
  • [6] Search organization in the whisper continuous speech recognition system
    Alleva, F
    1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, : 295 - 302
  • [7] Distant speech recognition:: Bridging the gaps
    McDonough, John
    Woelfel, Matthias
    2008 HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS, 2008, : 109 - +
  • [8] NEURAL NETWORKS FOR DISTANT SPEECH RECOGNITION
    Renals, Steve
    Swietojanski, Pawel
    2014 4TH JOINT WORKSHOP ON HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS (HSCMA), 2014, : 172 - 176
  • [9] On distant speech recognition for home automation
    Vacher, Michel
    Lecouteux, Benjamin
    Portet, François
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2015, 8700 : 161 - 188
  • [10] Single-Channel Speech Enhancement Techniques for Distant Speech Recognition
    Ashwini, Jaya
    Kumaraswamy, Ramaswamy
    JOURNAL OF INTELLIGENT SYSTEMS, 2013, 22 (02) : 81 - 93