Encoder-Decoder Based Attractors for End-to-End Neural Diarization

被引:15
|
作者
Horiguchi, Shota [1 ]
Fujita, Yusuke [1 ,2 ]
Watanabe, Shinji [3 ]
Xue, Yawen [1 ]
Garcia, Paola [4 ]
机构
[1] Hitachi Ltd, Kokubunji, Tokyo 1858601, Japan
[2] LINE Corp, Shinjuku Ku, Tokyo 1600004, Japan
[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[4] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
Training; Transformers; Time-frequency analysis; Voice activity detection; Task analysis; Neural networks; Licenses; Speaker diarization; EEND; EDA; SPEAKER DIARIZATION; SPEECH SEPARATION; BAYESIAN HMM;
D O I
10.1109/TASLP.2022.3162080
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional cascaded approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based attractor calculation module (EDA) to EEND. Once frame-wise embeddings are obtained, EDA sequentially generates speaker-wise attractors on the basis of a sequence-to-sequence method using an LSTM encoder-decoder. The attractor generation continues until a stopping condition is satisfied; thus, the number of attractors can be flexible. Diarization results are then estimated as dot products of the attractors and embeddings. The embeddings from speaker overlaps result in larger dot product values with multiple attractors; thus, this method can deal with speaker overlaps. Because the maximum number of output speakers is still limited by the training set, we also propose an iterative inference method to remove this restriction. Further, we propose a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against cascaded approaches. Extensive evaluations on simulated and real datasets show that EEND-EDA outperforms the conventional cascaded approach.
引用
收藏
页码:1493 / 1507
页数:15
相关论文
共 50 条
  • [21] Robust End-to-end Speaker Diarization with Generic Neural Clustering
    Yang, Chenyu
    Wang, Yu
    [J]. INTERSPEECH 2022, 2022, : 1471 - 1475
  • [22] End-To-End Neural Speaker Diarization Through Step-Function
    Latypov, Rustam
    Stolov, Evgeni
    [J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2021), 2021,
  • [23] TOWARDS END-TO-END SPEAKER DIARIZATION WITH GENERALIZED NEURAL SPEAKER CLUSTERING
    Zhang, Chunlei
    Shi, Jiatong
    Weng, Chao
    Yu, Meng
    Yu, Dong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8372 - 8376
  • [24] ONLINE END-TO-END NEURAL DIARIZATION WITH SPEAKER-TRACING BUFFER
    Xue, Yawen
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Garcia, Paola
    Nagamatsu, Kenji
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 841 - 848
  • [25] End-to-End Neural Speaker Diarization with Permutation-Free Objectives
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. INTERSPEECH 2019, 2019, : 4300 - 4304
  • [26] End-to-end neural speaker diarization with an iterative adaptive attractor estimation
    Hao, Fengyuan
    Li, Xiaodong
    Zheng, Chengshi
    [J]. NEURAL NETWORKS, 2023, 166 : 566 - 578
  • [27] MULTI-CHANNEL END-TO-END NEURAL DIARIZATION WITH DISTRIBUTED MICROPHONES
    Horiguchi, Shota
    Takashima, Yuki
    Garcia, Paola
    Watanabe, Shinji
    Kawaguchi, Yohei
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7332 - 7336
  • [28] INTEGRATING END-TO-END NEURAL AND CLUSTERING-BASED DIARIZATION: GETTING THE BEST OF BOTH WORLDS
    Kinoshita, Keisuke
    Delcroix, Marc
    Tawara, Naohiro
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7198 - 7202
  • [29] Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization
    Jiao, Xiaolin
    Chen, Yaqi
    Qu, Dan
    Yang, Xukui
    [J]. ELECTRONICS, 2023, 12 (19)
  • [30] Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech
    Kinoshita, Keisuke
    Delcroix, Marc
    Tawara, Naohiro
    [J]. INTERSPEECH 2021, 2021, : 3565 - 3569