SIMULTANEOUS SPEECH RECOGNITION AND SPEAKER DIARIZATION FOR MONAURAL DIALOGUE RECORDINGS WITH TARGET-SPEAKER ACOUSTIC MODELS

被引：0

作者：

Kanda, Naoyuki ^{[1
]}

Horiguchi, Shota ^{[1
]}

Fujita, Yusuke ^{[1
]}

Xue, Yawen ^{[1
]}

Nagamatsu, Kenji ^{[1
]}

Watanabe, Shinji ^{[2
]}

机构：

[1] Hitachi Ltd, Hitachi, Ibaraki, Japan

[2] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年

关键词：

multi-talker speech recognition; speaker diarization; deep learning; OVERLAPPED SPEECH;

D O I：

10.1109/asru46091.2019.9004009

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

引用

页码：31 / 38

页数：8

共 50 条

[1] Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition
Kanda, Naoyuki
Horiguchi, Shota
Takashima, Ryoichi
Fujita, Yusuke
Nagamatsu, Kenji
Watanabe, Shinji
[J]. INTERSPEECH 2019, 2019, : 236 - 240
[2] Robust Speaker Diarization for Short Speech Recordings
Imseng, David
Friedland, Gerald
[J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 432 - +
[3] Study on Integration of Speaker Diarization with Speaker Adaptive Speech Recognition for Broadcast Transcription
Silovsky, Jan
Cerva, Petr
Zdansky, Jindrich
Nouza, Jan
[J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 478 - 481
[4] Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
Medennikov, Ivan
Korenevsky, Maxim
Prisyach, Tatiana
Khokhlov, Yuri
Korenevskaya, Mariya
Sorokin, Ivan
Timofeeva, Tatiana
Mitrofanov, Anton
Andrusenko, Andrei
Podluzhny, Ivan
Laptev, Aleksandr
Romanenko, Aleksei
[J]. INTERSPEECH 2020, 2020, : 274 - 278
[5] Simultaneous Speech Detection With Spatial Features for Speaker Diarization
Zelenak, Martin
Segura, Carlos
Luque, Jordi
Hernando, Javier
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 436 - 446
[6] Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection
Moriya, Takafumi
Sato, Hiroshi
Ochiai, Tsubasa
Delcroix, Marc
Shinozaki, Takahiro
[J]. IEEE ACCESS, 2023, 11 : 13906 - 13917
[7] Time-Domain Target-Speaker Speech Separation With Waveform-Based Speaker Embedding
Zhao, Jianshu
Gao, Shengzhou
Shinozaki, Takahiro
[J]. INTERSPEECH 2020, 2020, : 1436 - 1440
[8] Speaker Diarization of Overlapping Speech based on Silence Distribution in Meeting Recordings
Yella, Harsha
Valente, Fabio
[J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 490 - 493
[9] An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings
Serafini, Luca
Cornell, Samuele
Morrone, Giovanni
Zovato, Enrico
Brutti, Alessio
Squartini, Stefano
[J]. COMPUTER SPEECH AND LANGUAGE, 2023, 82
[10] SPEAKER CONDITIONING OF ACOUSTIC MODELS USING AFFINE TRANSFORMATION FOR MULTI-SPEAKER SPEECH RECOGNITION
Yousefi, Midia
Hansen, John H. L.
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 283 - 288

← 1 2 3 4 5 →