LOW-LATENCY SPEECH SEPARATION GUIDED DIARIZATION FOR TELEPHONE CONVERSATIONS

被引:2
|
作者
Morrone, Giovanni [1 ]
Cornell, Samuele [1 ]
Raj, Desh [2 ]
Serafini, Luca [1 ]
Zovato, Enrico [3 ]
Brutti, Alessio [4 ]
Squartini, Stefano [1 ]
机构
[1] Univ Politecn Marche, Ancona, Italy
[2] Johns Hopkins Univ, Baltimore, MD USA
[3] PerVoice S p A, Trento, Italy
[4] Fondazione Bruno Kessler, Trento, Italy
关键词
online speaker diarization; speech separation; overlapped speech; deep learning; conversational telephone speech; SPEAKER DIARIZATION;
D O I
10.1109/SLT54892.2023.10023280
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we carry out an analysis on the use of speech separation guided diarization (SSGD) in telephone conversations. SSGD performs diarization by separating the speakers signals and then applying voice activity detection on each estimated speaker signal. In particular, we compare two low-latency speech separation models. Moreover, we show a post-processing algorithm that significantly reduces the false alarm errors of a SSGD pipeline. We perform our experiments on two datasets: Fisher Corpus Part 1 and CALLHOME, evaluating both separation and diarization metrics. Notably, our SSGD DPRNN-based online model achieves 11.1% DER on CALLHOME, comparable with most state-of-the-art end-to-end neural diarization models despite being trained on an order of magnitude less data and having considerably lower latency, i.e., 0.1 vs. 10 seconds. We also show that the separated signals can be readily fed to a speech recognition back-end with performance close to the oracle source signals.
引用
收藏
页码:641 / 646
页数:6
相关论文
共 50 条
  • [1] End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations
    Morrone, Giovanni
    Cornell, Samuele
    Serafini, Luca
    Zovato, Enrico
    Brutti, Alessio
    Squartini, Stefano
    [J]. SPEECH COMMUNICATION, 2024, 161
  • [2] LOW-LATENCY DEEP CLUSTERING FOR SPEECH SEPARATION
    Wang, Shanshan
    Naithani, Gaurav
    Virtanen, Tuomas
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 76 - 80
  • [3] Online Diarization of Telephone Conversations
    Ben-Harush, Oshry
    Lapidot, Itshak
    Guterman, Hugo
    [J]. ODYSSEY 2010: THE SPEAKER AND LANGUAGE RECOGNITION WORKSHOP, 2010, : 125 - 130
  • [4] Incremental Diarization of Telephone Conversations
    Ben-Harush, Oshiy
    Lapidot, Itshak
    Guterman, Hugo
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2226 - +
  • [5] LOW-LATENCY SPEAKER-INDEPENDENT CONTINUOUS SPEECH SEPARATION
    Yoshioka, Takuya
    Chen, Zhuo
    Liu, Changliang
    Xiao, Xiong
    Erdogan, Hakan
    Dimitriadis, Dimitrios
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6980 - 6984
  • [6] Low-Latency Neural Speech Translation
    Niehues, Jan
    Ngoc-Quan Pham
    Thanh-Le Ha
    Sperber, Matthias
    Waibel, Alex
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1293 - 1297
  • [7] PLDA-BASED DIARIZATION OF TELEPHONE CONVERSATIONS
    Bulut, Ahmet Emin
    Demir, Hakan
    Isik, Yusuf Ziya
    Erdogan, Hakan
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4809 - 4813
  • [8] Diarization of Telephone Conversations Using Factor Analysis
    Kenny, Patrick
    Reynolds, Douglas
    Castaldo, Fabio
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (06) : 1059 - 1070
  • [9] Dynamic Transcription for Low-latency Speech Translation
    Niehues, Jan
    Nguyen, Thai Son
    Cho, Eunah
    Ha, Thanh-Le
    Kilgour, Kevin
    Mueller, Markus
    Sperber, Matthias
    Stueker, Sebastian
    Waibel, Alex
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2513 - 2517
  • [10] Amortized Neural Networks for Low-Latency Speech Recognition
    Macoskey, Jonathan
    Strimel, Grant P.
    Su, Jinru
    Rastrow, Ariya
    [J]. INTERSPEECH 2021, 2021, : 4558 - 4562