LOW-LATENCY SPEECH SEPARATION GUIDED DIARIZATION FOR TELEPHONE CONVERSATIONS

被引：2

作者：

Morrone, Giovanni ^{[1
]}

Cornell, Samuele ^{[1
]}

Raj, Desh ^{[2
]}

Serafini, Luca ^{[1
]}

Zovato, Enrico ^{[3
]}

Brutti, Alessio ^{[4
]}

Squartini, Stefano ^{[1
]}

机构：

[1] Univ Politecn Marche, Ancona, Italy

[2] Johns Hopkins Univ, Baltimore, MD USA

[3] PerVoice S p A, Trento, Italy

[4] Fondazione Bruno Kessler, Trento, Italy

来源：

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年

关键词：

online speaker diarization; speech separation; overlapped speech; deep learning; conversational telephone speech; SPEAKER DIARIZATION;

D O I：

10.1109/SLT54892.2023.10023280

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we carry out an analysis on the use of speech separation guided diarization (SSGD) in telephone conversations. SSGD performs diarization by separating the speakers signals and then applying voice activity detection on each estimated speaker signal. In particular, we compare two low-latency speech separation models. Moreover, we show a post-processing algorithm that significantly reduces the false alarm errors of a SSGD pipeline. We perform our experiments on two datasets: Fisher Corpus Part 1 and CALLHOME, evaluating both separation and diarization metrics. Notably, our SSGD DPRNN-based online model achieves 11.1% DER on CALLHOME, comparable with most state-of-the-art end-to-end neural diarization models despite being trained on an order of magnitude less data and having considerably lower latency, i.e., 0.1 vs. 10 seconds. We also show that the separated signals can be readily fed to a speech recognition back-end with performance close to the oracle source signals.

引用

页码：641 / 646

页数：6

共 50 条

[1] End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations
Morrone, Giovanni
Cornell, Samuele
Serafini, Luca
Zovato, Enrico
Brutti, Alessio
Squartini, Stefano
[J]. SPEECH COMMUNICATION, 2024, 161
[2] LOW-LATENCY DEEP CLUSTERING FOR SPEECH SEPARATION
Wang, Shanshan
Naithani, Gaurav
Virtanen, Tuomas
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 76 - 80
[3] Online Diarization of Telephone Conversations
Ben-Harush, Oshry
Lapidot, Itshak
Guterman, Hugo
[J]. ODYSSEY 2010: THE SPEAKER AND LANGUAGE RECOGNITION WORKSHOP, 2010, : 125 - 130
[4] Incremental Diarization of Telephone Conversations
Ben-Harush, Oshiy
Lapidot, Itshak
Guterman, Hugo
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2226 - +
[5] LOW-LATENCY SPEAKER-INDEPENDENT CONTINUOUS SPEECH SEPARATION
Yoshioka, Takuya
Chen, Zhuo
Liu, Changliang
Xiao, Xiong
Erdogan, Hakan
Dimitriadis, Dimitrios
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6980 - 6984
[6] Low-Latency Neural Speech Translation
Niehues, Jan
Ngoc-Quan Pham
Thanh-Le Ha
Sperber, Matthias
Waibel, Alex
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1293 - 1297
[7] PLDA-BASED DIARIZATION OF TELEPHONE CONVERSATIONS
Bulut, Ahmet Emin
Demir, Hakan
Isik, Yusuf Ziya
Erdogan, Hakan
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4809 - 4813
[8] Diarization of Telephone Conversations Using Factor Analysis
Kenny, Patrick
Reynolds, Douglas
Castaldo, Fabio
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (06) : 1059 - 1070
[9] Dynamic Transcription for Low-latency Speech Translation
Niehues, Jan
Nguyen, Thai Son
Cho, Eunah
Ha, Thanh-Le
Kilgour, Kevin
Mueller, Markus
Sperber, Matthias
Stueker, Sebastian
Waibel, Alex
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2513 - 2517
[10] Amortized Neural Networks for Low-Latency Speech Recognition
Macoskey, Jonathan
Strimel, Grant P.
Su, Jinru
Rastrow, Ariya
[J]. INTERSPEECH 2021, 2021, : 4558 - 4562

← 1 2 3 4 5 →