INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

被引：38

作者：

Raj, Desh ^{[1
]}

Denisov, Pavel ^{[2
]}

Chen, Zhuo ^{[3
]}

Erdogan, Hakan ^{[4
]}

Huang, Zili ^{[1
]}

He, Maokui ^{[5
,6
]}

Watanabe, Shinji ^{[1
]}

Du, Jun ^{[5
,6
]}

Yoshioka, Takuya ^{[3
]}

Luo, Yi

Kanda, Naoyuki ^{[3
]}

Li, Jinyu ^{[3
]}

Wisdom, Scott ^{[4
]}

Hershey, John R. ^{[4
]}

机构：

[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany

[3] Microsoft Corp, Redmond, WA 98052 USA

[4] Google Res, Cambridge, MA USA

[5] Univ Sci & Technol China, Hefei, Peoples R China

[6] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Speech separation; diarization; speech recognition; multi-speaker;

D O I：

10.1109/SLT48900.2021.9383556

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.

引用

页码：897 / 904

页数：8

共 50 条

[11] End-to-End Multilingual Multi-Speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
INTERSPEECH 2019, 2019, : 3755 - 3759
[12] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
Chang, Xuankai
Zhang, Wangyou
Qian, Yanmin
Le Roux, Jonathan
Watanabe, Shinji
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
[13] Integration of audio-visual information for multi-speaker multimedia speaker recognition
Yang, Jichen
Chen, Fangfan
Cheng, Yu
Lin, Pei
DIGITAL SIGNAL PROCESSING, 2024, 145
[14] Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis
Ravi D. Shankar
R. B. Manjula
Rajashekhar C. Biradar
SN Computer Science, 6 (1)
[15] SPEAKER CONDITIONING OF ACOUSTIC MODELS USING AFFINE TRANSFORMATION FOR MULTI-SPEAKER SPEECH RECOGNITION
Yousefi, Midia
Hansen, John H. L.
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 283 - 288
[16] SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DYSARTHRIC SPEECH RECOGNITION
Soleymanpour, Mohammad
Johnson, Michael T.
Soleymanpour, Rahim
Berry, Jeffrey
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7382 - 7386
[17] A unified network for multi-speaker speech recognition with multi-channel recordings
Liu, Conggui
Inoue, Nakamasa
Shinoda, Koichi
2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1304 - 1307
[18] Multi-Speaker Adaptation for Robust Speech Recognition under Ubiquitous Environment
Shih, Po-Yi
Wang, Jhing-Fa
Lin, Yuan-Ning
Fu, Zhong-Hua
ORIENTAL COCOSDA 2009 - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2009, : 126 - 131
[19] Automatic Multi-Speaker Speech Recognition System Based on Time-Frequency Blind Source Separation under Ubiquitous Environment
Wang, Zhe
Zhang, Haijian
Bi, Guoan
Li, Xiumei
PROCEEDINGS OF THE 2014 9TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2014, : 101 - +
[20] MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION
Chang, Xuankai
Zhang, Wangyou
Qian, Yanmin
Le Roux, Jonathan
Watanabe, Shinji
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 237 - 244

← 1 2 3 4 5 →