MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

被引:0
|
作者
Chang, Xuankai [1 ]
Guo, Pengcheng [2 ]
Fujita, Yuya [3 ]
Maekaku, Takashi [3 ]
Watanabe, Shinji [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15217 USA
[2] Northwestern Polytech Univ, Xian 710060, Peoples R China
[3] LY Corp, Tokyo 1028282, Japan
关键词
Automatic speech recognition; deep learning; distant speech processing;
D O I
10.1109/LSP.2024.3449218
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Distant Automatic Speech Recognition (DASR) stands as a crucial aspect in the realm of speech and audio processing. Recent advancements have spotlighted the efficacy of pre-trained speech foundation models, exemplified by Whisper, garnering considerable attention in the speech-processing domain. Thesemodels, trained on hundreds of thousands of hours of speech data, exhibit notable strengths in performance and generalization across various zero-shot scenarios. However, a limitation arises from their exclusive handling of single-channel input due to challenges in accumulating extensive multi-channel speech data. The spatial information in the multi-channel input is important for the DASR task. This study introduces an innovation by enabling the incorporation of multi-channel (MC) signals into the pre-trained Whisper model, called MC-Whisper. The proposed model introduces a multi-channel speech processing branch as a sidecar, to maximize the utilization of the foundation model's ability to handle multi-channel input. Experimental results on the distant microphone speech recordings from AMI meeting corpus demonstrate substantial improvements through the proposed approach.
引用
收藏
页码:2850 / 2854
页数:5
相关论文
共 50 条
  • [31] N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition
    Talafha, Bashar
    Waheed, Abdul
    Abdul-Mageed, Muhammad
    INTERSPEECH 2023, 2023, : 5092 - 5096
  • [32] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
    Rouditchenko, Andrew
    Gong, Yuan
    Thomas, Samuel
    Karlinsky, Leonid
    Kuehne, Hilde
    Feris, Rogerio
    Glass, James
    INTERSPEECH 2024, 2024, : 2420 - 2424
  • [33] On the selection of the impulse responses for distant-speech recognition based on contaminated speech training
    Ravanelli, Mirco
    Omologo, Maurizio
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1028 - 1032
  • [34] Contaminated speech training methods for robust DNN-HMM distant speech recognition
    Ravanelli, Mirco
    Omologo, Maurizio
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 756 - 760
  • [36] Distant Speech Recognition Using a Microphone Array Network
    Nakano, Alberto Yoshihiro
    Nakagawa, Seiichi
    Yamamoto, Kazumasa
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2010, E93D (09): : 2451 - 2462
  • [37] Improve Multichannel Speech Recognition with Temporal and Spatial Information
    Zhang, Yu
    Zhang, Pengyuan
    Zhao, Qingwei
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (07) : 1963 - 1967
  • [38] Automatic context window composition for distant speech recognition
    Ravanelli, Mirco
    Omologo, Maurizio
    SPEECH COMMUNICATION, 2018, 101 : 34 - 44
  • [39] Multichannel NMF with Reduced Computational Complexity for Speech Recognition
    Izumi, Taiki
    Uramoto, Takanobu
    Uenohara, Shingo
    Furuya, Ken'ichi
    Aihara, Ryo
    Hanazawa, Toshiyuki
    Okato, Yohei
    2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 192 - 195
  • [40] On Comparison of Deep Learning Architectures for Distant Speech Recognition
    Sustika, Rika
    Yuliani, Asri R.
    Zaenudin, Efendi
    Pardede, Hilman F.
    2017 2ND INTERNATIONAL CONFERENCES ON INFORMATION TECHNOLOGY, INFORMATION SYSTEMS AND ELECTRICAL ENGINEERING (ICITISEE): OPPORTUNITIES AND CHALLENGES ON BIG DATA FUTURE INNOVATION, 2017, : 17 - 21