End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

被引:70
|
作者
Horiguchi, Shota [1 ]
Fujita, Yusuke [1 ]
Watanabe, Shinji [2 ]
Xue, Yawen [1 ]
Nagamatsu, Kenji [1 ]
机构
[1] Hitachi Ltd, Tokyo, Japan
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
来源
关键词
speaker diarization; encoder-decoder; attractor calculation;
D O I
10.21437/Interspeech.2020-1022
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69% diarization error rate (DER) on simulated mixtures and a 8.07% DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56% and 9.54 %, respectively. In unknown numbers of speakers conditions, our method attained a 15.29% DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43% DER.
引用
收藏
页码:269 / 273
页数:5
相关论文
共 50 条
  • [1] Encoder-Decoder Based Attractors for End-to-End Neural Diarization
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Xue, Yawen
    Garcia, Paola
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1493 - 1507
  • [2] Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer
    Chen, Zhengyang
    Han, Bing
    Wang, Shuai
    Qian, Yanmin
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1636 - 1649
  • [3] Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors
    Chetupalli, Srikanth Raj
    Habets, Emanuel A. P.
    [J]. INTERSPEECH 2022, 2022, : 5393 - 5397
  • [4] End-to-End Deep Background Subtraction based on Encoder-Decoder Network
    Le, Duy H.
    Pham, Tuan, V
    [J]. PROCEEDINGS OF 2019 6TH NATIONAL FOUNDATION FOR SCIENCE AND TECHNOLOGY DEVELOPMENT (NAFOSTED) CONFERENCE ON INFORMATION AND COMPUTER SCIENCE (NICS), 2019, : 381 - 386
  • [5] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [6] BW-EDA-EEND: STREAMING END-TO-END NEURAL SPEAKER DIARIZATION FOR A VARIABLE NUMBER OF SPEAKERS
    Han, Eunjung
    Lee, Chul
    Stolcke, Andreas
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7193 - 7197
  • [7] END-TO-END DIARIZATION FOR VARIABLE NUMBER OF SPEAKERS WITH LOCAL-GLOBAL NETWORKS AND DISCRIMINATIVE SPEAKER EMBEDDINGS
    Maiti, Soumi
    Erdogan, Hakan
    Wilson, Kevin
    Wisdom, Scott
    Watanabe, Shinji
    Hershey, John R.
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7183 - 7187
  • [8] End-to-End Trained CNN Encoder-Decoder Networks for Image Steganography
    Rehman, Atique ur
    Rahim, Rafia
    Nadeem, Shahroz
    ul Hussain, Sibt
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 723 - 729
  • [9] TRANSCRIBE-TO-DIARIZE: NEURAL SPEAKER DIARIZATION FOR UNLIMITED NUMBER OF SPEAKERS USING END-TO-END SPEAKER-ATTRIBUTED ASR
    Kanda, Naoyuki
    Xiao, Xiong
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8082 - 8086
  • [10] EEND-SS: JOINT END-TO-END NEURAL SPEAKER DIARIZATION AND SPEECH SEPARATION FOR FLEXIBLE NUMBER OF SPEAKERS
    Maiti, Soumi
    Ueda, Yushi
    Watanabe, Shinji
    Zhang, Chunlei
    Yu, Meng
    Zhang, Shi-Xiong
    Xu, Yong
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 480 - 487