INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

被引：30

作者：

Raj, Desh ^{[1
]}

Denisov, Pavel ^{[2
]}

Chen, Zhuo ^{[3
]}

Erdogan, Hakan ^{[4
]}

Huang, Zili ^{[1
]}

He, Maokui ^{[5
,6
]}

Watanabe, Shinji ^{[1
]}

Du, Jun ^{[5
,6
]}

Yoshioka, Takuya ^{[3
]}

Luo, Yi

Kanda, Naoyuki ^{[3
]}

Li, Jinyu ^{[3
]}

Wisdom, Scott ^{[4
]}

Hershey, John R. ^{[4
]}

机构：

[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany

[3] Microsoft Corp, Redmond, WA 98052 USA

[4] Google Res, Cambridge, MA USA

[5] Univ Sci & Technol China, Hefei, Peoples R China

[6] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Speech separation; diarization; speech recognition; multi-speaker;

D O I：

10.1109/SLT48900.2021.9383556

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module. Our best system achieves a speaker-attributed WER of 12.7%, which is close to that of a non-overlapping ASR.

引用

下载

页码：897 / 904

页数：8

共 50 条

[1] Speech Recognition and Multi-Speaker Diarization of Long Conversations
Mao, Huanru Henry
Li, Shuyang
McAuley, Julian
Cottrell, Garrison W.
INTERSPEECH 2020, 2020, : 691 - 695
[2] MULTI-SPEAKER CONVERSATIONS, CROSS-TALK, AND DIARIZATION FOR SPEAKER RECOGNITION
Sell, Gregory
McCree, Alan
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5425 - 5429
[3] Sparse Component Analysis for Speech Recognition in Multi-Speaker Environment
Asaei, Afsaneh
Bourlard, Herve
Garner, Philip N.
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1704 - 1707
[4] A Purely End-to-end System for Multi-speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630
[5] Study on Integration of Speaker Diarization with Speaker Adaptive Speech Recognition for Broadcast Transcription
Silovsky, Jan
Cerva, Petr
Zdansky, Jindrich
Nouza, Jan
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 478 - 481
[6] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
Settle, Shane
Le Roux, Jonathan
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
[7] Advances in multi-speaker conversational speech recognition and understanding
Hori, Takaaki
Araki, Shoko
Nakatani, Tomohiro O.
Nakamura, Atsushi
NTT Technical Review, 2013, 11 (12):
[8] The SAIL Speaker Diarization System for Analysis of Spontaneous Meetings
Han, Kyu J.
Georgiou, Panayiotis G.
Narayanan, Shrikanth S.
2008 IEEE 10TH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, VOLS 1 AND 2, 2008, : 970 - 975
[9] Fast ICA for Multi-speaker Recognition System
Zhou, Yan
Zhao, Zhiqiang
ADVANCED INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, 2010, 93 : 507 - 513
[10] End-to-End Multilingual Multi-Speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
INTERSPEECH 2019, 2019, : 3755 - 3759

← 1 2 3 4 5 →