INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

被引:30
|
作者
Raj, Desh [1 ]
Denisov, Pavel [2 ]
Chen, Zhuo [3 ]
Erdogan, Hakan [4 ]
Huang, Zili [1 ]
He, Maokui [5 ,6 ]
Watanabe, Shinji [1 ]
Du, Jun [5 ,6 ]
Yoshioka, Takuya [3 ]
Luo, Yi
Kanda, Naoyuki [3 ]
Li, Jinyu [3 ]
Wisdom, Scott [4 ]
Hershey, John R. [4 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
[3] Microsoft Corp, Redmond, WA 98052 USA
[4] Google Res, Cambridge, MA USA
[5] Univ Sci & Technol China, Hefei, Peoples R China
[6] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA
关键词
Speech separation; diarization; speech recognition; multi-speaker;
D O I
10.1109/SLT48900.2021.9383556
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.
引用
下载
收藏
页码:897 / 904
页数:8
相关论文
共 50 条
  • [1] Speech Recognition and Multi-Speaker Diarization of Long Conversations
    Mao, Huanru Henry
    Li, Shuyang
    McAuley, Julian
    Cottrell, Garrison W.
    INTERSPEECH 2020, 2020, : 691 - 695
  • [2] MULTI-SPEAKER CONVERSATIONS, CROSS-TALK, AND DIARIZATION FOR SPEAKER RECOGNITION
    Sell, Gregory
    McCree, Alan
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5425 - 5429
  • [3] Sparse Component Analysis for Speech Recognition in Multi-Speaker Environment
    Asaei, Afsaneh
    Bourlard, Herve
    Garner, Philip N.
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1704 - 1707
  • [4] A Purely End-to-end System for Multi-speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630
  • [5] Study on Integration of Speaker Diarization with Speaker Adaptive Speech Recognition for Broadcast Transcription
    Silovsky, Jan
    Cerva, Petr
    Zdansky, Jindrich
    Nouza, Jan
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 478 - 481
  • [6] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [7] Advances in multi-speaker conversational speech recognition and understanding
    Hori, Takaaki
    Araki, Shoko
    Nakatani, Tomohiro O.
    Nakamura, Atsushi
    NTT Technical Review, 2013, 11 (12):
  • [8] The SAIL Speaker Diarization System for Analysis of Spontaneous Meetings
    Han, Kyu J.
    Georgiou, Panayiotis G.
    Narayanan, Shrikanth S.
    2008 IEEE 10TH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, VOLS 1 AND 2, 2008, : 970 - 975
  • [9] Fast ICA for Multi-speaker Recognition System
    Zhou, Yan
    Zhao, Zhiqiang
    ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, 2010, 93 : 507 - 513
  • [10] End-to-End Multilingual Multi-Speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    INTERSPEECH 2019, 2019, : 3755 - 3759