SIMULTANEOUS SPEECH RECOGNITION AND SPEAKER DIARIZATION FOR MONAURAL DIALOGUE RECORDINGS WITH TARGET-SPEAKER ACOUSTIC MODELS

被引：0

作者：

Kanda, Naoyuki ^{[1
]}

Horiguchi, Shota ^{[1
]}

Fujita, Yusuke ^{[1
]}

Xue, Yawen ^{[1
]}

Nagamatsu, Kenji ^{[1
]}

Watanabe, Shinji ^{[2
]}

机构：

[1] Hitachi Ltd, Hitachi, Ibaraki, Japan

[2] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年

关键词：

multi-talker speech recognition; speaker diarization; deep learning; OVERLAPPED SPEECH;

D O I：

10.1109/asru46091.2019.9004009

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

引用

页码：31 / 38

页数：8

共 50 条

[41] SPEAKER DIARIZATION AND SPEECH RECOGNITION IN THE SEMI-AUTOMATIZATION OF AUDIO DESCRIPTION: AN EXPLORATORY STUDY ON FUTURE POSSIBILITIES?
Delgado, Hector
Matamala, Anna
Serrano, Javier
[J]. CADERNOS DE TRADUCAO, 2015, 35 (02): : 308 - 324
[42] INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS
Raj, Desh
Denisov, Pavel
Chen, Zhuo
Erdogan, Hakan
Huang, Zili
He, Maokui
Watanabe, Shinji
Du, Jun
Yoshioka, Takuya
Luo, Yi
Kanda, Naoyuki
Li, Jinyu
Wisdom, Scott
Hershey, John R.
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 897 - 904
[43] A unified network for multi-speaker speech recognition with multi-channel recordings
Liu, Conggui
Inoue, Nakamasa
Shinoda, Koichi
[J]. 2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1304 - 1307
[44] ACOUSTIC MODELING OF SUBWORD UNITS FOR LARGE VOCABULARY SPEAKER INDEPENDENT SPEECH RECOGNITION
LEE, CH
RABINER, LR
PIERACCINI, R
WILPON, JG
[J]. SPEECH AND NATURAL LANGUAGE, 1989, : 280 - 291
[45] Synergy of lip-motion and acoustic features in biometric speech and speaker recognition
Faraj, Maycel-Isaac
Bigun, Josef
[J]. IEEE TRANSACTIONS ON COMPUTERS, 2007, 56 (09) : 1169 - 1175
[46] Acoustic training system for speaker independent continuous arabic speech recognition system
Nofal, M
Abdel-Raheem, E
El Henawy, H
Kader, NA
[J]. Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004, : 200 - 203
[47] PROGRESSIVE MULTI-TARGET NETWORK BASED SPEECH ENHANCEMENT WITH SNR-PRESELECTION FOR ROBUST SPEAKER DIARIZATION
Sun, Lei
Du, Jun
Zhang, Xueyang
Gao, Tian
Fang, Xin
Lee, Chin-Hui
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7099 - 7103
[48] SPEAKER REINFORCEMENT USING TARGET SOURCE EXTRACTION FOR ROBUST AUTOMATIC SPEECH RECOGNITION
Zorila, Catalin
Doddipatla, Rama
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6297 - 6301
[49] AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
Fu, Yihui
Cheng, Luyao
Lv, Shubo
Jv, Yukai
Kong, Yuxiang
Chen, Zhuo
Hu, Yanxin
Xie, Lei
Wu, Jian
Bu, Hui
Xu, Xin
Du, Jun
Chen, Jingdong
[J]. INTERSPEECH 2021, 2021, : 3665 - 3669
[50] Speaker-independent embedded speech recognition using Hidden Markov Models
Marufo da Silva, Mariano
Evin, Diego A.
Verrastro, Sebastian
[J]. IEEE CACIDI 2016 - IEEE CONFERENCE ON COMPUTER SCIENCES, 2016,

← 1 2 3 4 5 →