SIMULTANEOUS SPEECH RECOGNITION AND SPEAKER DIARIZATION FOR MONAURAL DIALOGUE RECORDINGS WITH TARGET-SPEAKER ACOUSTIC MODELS

被引:0
|
作者
Kanda, Naoyuki [1 ]
Horiguchi, Shota [1 ]
Fujita, Yusuke [1 ]
Xue, Yawen [1 ]
Nagamatsu, Kenji [1 ]
Watanabe, Shinji [2 ]
机构
[1] Hitachi Ltd, Hitachi, Ibaraki, Japan
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
multi-talker speech recognition; speaker diarization; deep learning; OVERLAPPED SPEECH;
D O I
10.1109/asru46091.2019.9004009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.
引用
收藏
页码:31 / 38
页数:8
相关论文
共 50 条
  • [41] SPEAKER DIARIZATION AND SPEECH RECOGNITION IN THE SEMI-AUTOMATIZATION OF AUDIO DESCRIPTION: AN EXPLORATORY STUDY ON FUTURE POSSIBILITIES?
    Delgado, Hector
    Matamala, Anna
    Serrano, Javier
    [J]. CADERNOS DE TRADUCAO, 2015, 35 (02): : 308 - 324
  • [42] INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS
    Raj, Desh
    Denisov, Pavel
    Chen, Zhuo
    Erdogan, Hakan
    Huang, Zili
    He, Maokui
    Watanabe, Shinji
    Du, Jun
    Yoshioka, Takuya
    Luo, Yi
    Kanda, Naoyuki
    Li, Jinyu
    Wisdom, Scott
    Hershey, John R.
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 897 - 904
  • [43] A unified network for multi-speaker speech recognition with multi-channel recordings
    Liu, Conggui
    Inoue, Nakamasa
    Shinoda, Koichi
    [J]. 2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1304 - 1307
  • [44] ACOUSTIC MODELING OF SUBWORD UNITS FOR LARGE VOCABULARY SPEAKER INDEPENDENT SPEECH RECOGNITION
    LEE, CH
    RABINER, LR
    PIERACCINI, R
    WILPON, JG
    [J]. SPEECH AND NATURAL LANGUAGE, 1989, : 280 - 291
  • [45] Synergy of lip-motion and acoustic features in biometric speech and speaker recognition
    Faraj, Maycel-Isaac
    Bigun, Josef
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2007, 56 (09) : 1169 - 1175
  • [46] Acoustic training system for speaker independent continuous arabic speech recognition system
    Nofal, M
    Abdel-Raheem, E
    El Henawy, H
    Kader, NA
    [J]. Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004, : 200 - 203
  • [47] PROGRESSIVE MULTI-TARGET NETWORK BASED SPEECH ENHANCEMENT WITH SNR-PRESELECTION FOR ROBUST SPEAKER DIARIZATION
    Sun, Lei
    Du, Jun
    Zhang, Xueyang
    Gao, Tian
    Fang, Xin
    Lee, Chin-Hui
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7099 - 7103
  • [48] SPEAKER REINFORCEMENT USING TARGET SOURCE EXTRACTION FOR ROBUST AUTOMATIC SPEECH RECOGNITION
    Zorila, Catalin
    Doddipatla, Rama
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6297 - 6301
  • [49] AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
    Fu, Yihui
    Cheng, Luyao
    Lv, Shubo
    Jv, Yukai
    Kong, Yuxiang
    Chen, Zhuo
    Hu, Yanxin
    Xie, Lei
    Wu, Jian
    Bu, Hui
    Xu, Xin
    Du, Jun
    Chen, Jingdong
    [J]. INTERSPEECH 2021, 2021, : 3665 - 3669
  • [50] Speaker-independent embedded speech recognition using Hidden Markov Models
    Marufo da Silva, Mariano
    Evin, Diego A.
    Verrastro, Sebastian
    [J]. IEEE CACIDI 2016 - IEEE CONFERENCE ON COMPUTER SCIENCES, 2016,