INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

被引:30
|
作者
Raj, Desh [1 ]
Denisov, Pavel [2 ]
Chen, Zhuo [3 ]
Erdogan, Hakan [4 ]
Huang, Zili [1 ]
He, Maokui [5 ,6 ]
Watanabe, Shinji [1 ]
Du, Jun [5 ,6 ]
Yoshioka, Takuya [3 ]
Luo, Yi
Kanda, Naoyuki [3 ]
Li, Jinyu [3 ]
Wisdom, Scott [4 ]
Hershey, John R. [4 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
[3] Microsoft Corp, Redmond, WA 98052 USA
[4] Google Res, Cambridge, MA USA
[5] Univ Sci & Technol China, Hefei, Peoples R China
[6] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA
关键词
Speech separation; diarization; speech recognition; multi-speaker;
D O I
10.1109/SLT48900.2021.9383556
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.
引用
下载
收藏
页码:897 / 904
页数:8
相关论文
共 50 条
  • [31] MULTI-SPEAKER SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR DATA AUGMENTATION IN ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6161 - 6165
  • [32] SOURCE-AWARE CONTEXT NETWORK FOR SINGLE-CHANNEL MULTI-SPEAKER SPEECH SEPARATION
    Li, Zeng-Xi
    Song, Yan
    Dai, Li-Rong
    McLoughlin, Ian
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 681 - 685
  • [33] Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network
    Ke, Shanfa
    Hu, Ruimin
    Wang, Xiaochen
    Wu, Tingzhao
    Li, Gang
    Wang, Zhongyuan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (43-44) : 32225 - 32241
  • [34] Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation
    Lyu, Ke-Ming
    Lyu, Ren-yuan
    Chang, Hsien-Tsung
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [35] AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
    Fu, Yihui
    Cheng, Luyao
    Lv, Shubo
    Jv, Yukai
    Kong, Yuxiang
    Chen, Zhuo
    Hu, Yanxin
    Xie, Lei
    Wu, Jian
    Bu, Hui
    Xu, Xin
    Du, Jun
    Chen, Jingdong
    INTERSPEECH 2021, 2021, : 3665 - 3669
  • [36] A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings
    Yu, Fan
    Du, Zhihao
    Zhang, Shiliang
    Lin, Yuxiao
    Xie, Lei
    INTERSPEECH 2022, 2022, : 560 - 564
  • [37] DIRECTIONAL ASR: A NEW PARADIGM FOR E2E MULTI-SPEAKER SPEECH RECOGNITION WITH SOURCE LOCALIZATION
    Subramanian, Aswin Shanmugam
    Weng, Chao
    Watanabe, Shinji
    Yu, Meng
    Xu, Yong
    Zhang, Shi-Xiong
    Yu, Dong
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8433 - 8437
  • [38] THE HUYA MULTI-SPEAKER AND MULTI-STYLE SPEECH SYNTHESIS SYSTEM FOR M2VOC CHALLENGE 2020
    Wang, Jie
    You, Yuren
    Liu, Feng
    Tuo, Deyi
    Kang, Shiyin
    Wu, Zhiyong
    Meng, Helen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8608 - 8612
  • [39] A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings
    Shi, Mohan
    Zhang, Jie
    Du, Zhihao
    Yu, Fan
    Chen, Qian
    Zhang, Shiliang
    Dai, Li-Rong
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1943 - 1948
  • [40] Analysis of the Effect of Speech-Laugh on Speaker Recognition System
    Dumpala, Sri Harsha
    Panda, Ashish
    Kopparapu, Sunil Kumar
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1751 - 1755