Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization

被引:0
|
作者
Fujita, Yusuke [1 ,2 ]
Ogawa, Tetsuji [2 ]
Kobayashi, Tetsunori [2 ]
机构
[1] LY Corp, Tokyo 1028282, Japan
[2] Waseda Univ, Dept Comp Sci & Commun Engn, Tokyo 1620042, Japan
关键词
Encoder-decoder-based attractors; end-to-end neural diarization; intermediate objectives; non-autoregressive models; self-conditioning; speaker diarization;
D O I
10.1109/ACCESS.2023.3340307
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the self-conditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with self-conditioning, leading to superior performance compared with existing diarization models.
引用
收藏
页码:140069 / 140076
页数:8
相关论文
共 50 条
  • [41] INTEGRATING END-TO-END NEURAL AND CLUSTERING-BASED DIARIZATION: GETTING THE BEST OF BOTH WORLDS
    Kinoshita, Keisuke
    Delcroix, Marc
    Tawara, Naohiro
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7198 - 7202
  • [42] Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization
    Jiao, Xiaolin
    Chen, Yaqi
    Qu, Dan
    Yang, Xukui
    [J]. ELECTRONICS, 2023, 12 (19)
  • [43] Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer
    Chen, Zhengyang
    Han, Bing
    Wang, Shuai
    Qian, Yanmin
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1636 - 1649
  • [44] Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
    Xue, Yawen
    Horiguchi, Shota
    Fujita, Yusuke
    Takashima, Yuki
    Watanabe, Shinji
    Garcia, Paola
    Nagamatsu, Kenji
    [J]. INTERSPEECH 2021, 2021, : 3116 - 3120
  • [45] Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech
    Kinoshita, Keisuke
    Delcroix, Marc
    Tawara, Naohiro
    [J]. INTERSPEECH 2021, 2021, : 3565 - 3569
  • [46] OVERLAP-AWARE DIARIZATION: RESEGMENTATION USING NEURAL END-TO-END OVERLAPPED SPEECH DETECTION
    Bullock, Latane
    Bredin, Herve
    Garcia-Perera, Leibny Paola
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7114 - 7118
  • [47] A High-Performance Neural Network SoC for End-to-End Speaker Verification
    Tsai, Tsung-Han
    Chiang, Meng-Jui
    [J]. IEEE Access, 2024, 12 : 165482 - 165496
  • [48] SPEAKER ADAPTATION FOR END-TO-END CTC MODELS
    Li, Ke
    Li, Jinyu
    Zhao, Yong
    Kumar, Kshitiz
    Gong, Yifan
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 542 - 549
  • [49] GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION
    Wan, Li
    Wang, Quan
    Papir, Alan
    Moreno, Ignacio Lopez
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4879 - 4883
  • [50] Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection
    Dinkel, Heinrich
    Qian, Yanmin
    Yu, Kai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (11) : 2002 - 2014