An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings

被引:1
|
作者
Serafini, Luca [1 ]
Cornell, Samuele [1 ]
Morrone, Giovanni [1 ]
Zovato, Enrico [2 ]
Brutti, Alessio [3 ]
Squartini, Stefano [1 ]
机构
[1] Univ Politecn Marche, Ancona, Italy
[2] PerVoice SpA, Trento, Italy
[3] Fdn Bruno Kessler, Trento, Italy
来源
关键词
Speaker diarization; Conversational telephone speech; Deep learning; End-to-end neural diarization; Speech separation guided diarization; SEPARATION;
D O I
10.1016/j.csl.2023.101534
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We performed an experimental review of current diarization systems for the conversational telephone speech (CTS) domain. In detail, we considered a total of eight different algorithms be-longing to clustering-based, end-to-end neural diarization (EEND), and speech separation guided diarization (SSGD) paradigms. We studied the inference-time computational requirements and diarization accuracy on four CTS datasets with different characteristics and languages. We found that, among all methods considered, EEND-vector clustering (EEND-VC) offers the best trade-off in terms of computing requirements and performance. More in general, EEND models have been found to be lighter and faster in inference compared to clustering-based methods. However, they also require a large amount of diarization-oriented annotated data. In particular EEND-VC performance in our experiments degraded when the dataset size was reduced, whereas self -attentive EEND (SA-EEND) was less affected. We also found that SA-EEND gives less consistent results among all the datasets compared to EEND-VC, with its performance degrading on long conversations with high speech sparsity. Clustering-based diarization systems, and in particular VBx, instead have more consistent performance compared to SA-EEND but are outperformed by EEND-VC. The gap with respect to this latter is reduced when overlap-aware clustering methods are considered. SSGD is the most computationally demanding method, but it could be convenient if speech recognition has to be performed. Its performance is close to SA-EEND but degrades significantly when the training and inference data characteristics are less matched.
引用
收藏
页数:22
相关论文
共 26 条
  • [1] A SPEAKER REDIARIZATION SCHEME FOR IMPROVING DIARIZATION IN LARGE TWO-SPEAKER TELEPHONE DATASETS
    Ghaemmaghami, Houman
    Dean, David
    Sridharan, Sridha
    [J]. 2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 1272 - 1276
  • [2] Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech
    Zajic, Zbynek
    Zelinka, Jan
    Mueller, Ludek
    [J]. SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 555 - 563
  • [3] Robust Speaker Diarization for Short Speech Recordings
    Imseng, David
    Friedland, Gerald
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 432 - +
  • [4] Speaker normalization on conversational telephone speech
    Wegmann, S
    McAllaster, D
    Orloff, J
    Peskin, B
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 339 - 341
  • [5] Two-speaker Voiced/Unvoiced Decision for Monaural Speech
    Jihen Zeremdini
    Mohamed Anouar Ben Messaoud
    Aicha Bouzid
    [J]. Circuits, Systems, and Signal Processing, 2020, 39 : 4399 - 4415
  • [6] Two-speaker Voiced/Unvoiced Decision for Monaural Speech
    Zeremdini, Jihen
    Ben Messaoud, Mohamed Anouar
    Bouzid, Aicha
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2020, 39 (09) : 4399 - 4415
  • [7] Speaker Diarization of Overlapping Speech based on Silence Distribution in Meeting Recordings
    Yella, Harsha
    Valente, Fabio
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 490 - 493
  • [8] SIMULTANEOUS SPEECH RECOGNITION AND SPEAKER DIARIZATION FOR MONAURAL DIALOGUE RECORDINGS WITH TARGET-SPEAKER ACOUSTIC MODELS
    Kanda, Naoyuki
    Horiguchi, Shota
    Fujita, Yusuke
    Xue, Yawen
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 31 - 38
  • [9] The Influence of Speech Activity Detection and Overlap on Speaker Diarization for Meeting Room Recordings
    Fredouille, Corinne
    Evans, Nicholas
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2704 - 2707
  • [10] Investigation of Segmentation in i-Vector Based Speaker Diarization of Telephone Speech
    Zajic, Zbynek
    Kunesova, Marie
    Radova, Vlasta
    [J]. SPEECH AND COMPUTER, 2016, 9811 : 411 - 418