SIMULTANEOUS SPEECH RECOGNITION AND SPEAKER DIARIZATION FOR MONAURAL DIALOGUE RECORDINGS WITH TARGET-SPEAKER ACOUSTIC MODELS

被引:0
|
作者
Kanda, Naoyuki [1 ]
Horiguchi, Shota [1 ]
Fujita, Yusuke [1 ]
Xue, Yawen [1 ]
Nagamatsu, Kenji [1 ]
Watanabe, Shinji [2 ]
机构
[1] Hitachi Ltd, Hitachi, Ibaraki, Japan
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
multi-talker speech recognition; speaker diarization; deep learning; OVERLAPPED SPEECH;
D O I
10.1109/asru46091.2019.9004009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.
引用
收藏
页码:31 / 38
页数:8
相关论文
共 50 条
  • [1] Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition
    Kanda, Naoyuki
    Horiguchi, Shota
    Takashima, Ryoichi
    Fujita, Yusuke
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. INTERSPEECH 2019, 2019, : 236 - 240
  • [2] Robust Speaker Diarization for Short Speech Recordings
    Imseng, David
    Friedland, Gerald
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 432 - +
  • [3] Study on Integration of Speaker Diarization with Speaker Adaptive Speech Recognition for Broadcast Transcription
    Silovsky, Jan
    Cerva, Petr
    Zdansky, Jindrich
    Nouza, Jan
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 478 - 481
  • [4] Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
    Medennikov, Ivan
    Korenevsky, Maxim
    Prisyach, Tatiana
    Khokhlov, Yuri
    Korenevskaya, Mariya
    Sorokin, Ivan
    Timofeeva, Tatiana
    Mitrofanov, Anton
    Andrusenko, Andrei
    Podluzhny, Ivan
    Laptev, Aleksandr
    Romanenko, Aleksei
    [J]. INTERSPEECH 2020, 2020, : 274 - 278
  • [5] Simultaneous Speech Detection With Spatial Features for Speaker Diarization
    Zelenak, Martin
    Segura, Carlos
    Luque, Jordi
    Hernando, Javier
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 436 - 446
  • [6] Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection
    Moriya, Takafumi
    Sato, Hiroshi
    Ochiai, Tsubasa
    Delcroix, Marc
    Shinozaki, Takahiro
    [J]. IEEE ACCESS, 2023, 11 : 13906 - 13917
  • [7] Time-Domain Target-Speaker Speech Separation With Waveform-Based Speaker Embedding
    Zhao, Jianshu
    Gao, Shengzhou
    Shinozaki, Takahiro
    [J]. INTERSPEECH 2020, 2020, : 1436 - 1440
  • [8] Speaker Diarization of Overlapping Speech based on Silence Distribution in Meeting Recordings
    Yella, Harsha
    Valente, Fabio
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 490 - 493
  • [9] An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings
    Serafini, Luca
    Cornell, Samuele
    Morrone, Giovanni
    Zovato, Enrico
    Brutti, Alessio
    Squartini, Stefano
    [J]. COMPUTER SPEECH AND LANGUAGE, 2023, 82
  • [10] SPEAKER CONDITIONING OF ACOUSTIC MODELS USING AFFINE TRANSFORMATION FOR MULTI-SPEAKER SPEECH RECOGNITION
    Yousefi, Midia
    Hansen, John H. L.
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 283 - 288