Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization

被引：0

作者：

Fujita, Yusuke ^{[1
,2
]}

Ogawa, Tetsuji ^{[2
]}

Kobayashi, Tetsunori ^{[2
]}

机构：

[1] LY Corp, Tokyo 1028282, Japan

[2] Waseda Univ, Dept Comp Sci & Commun Engn, Tokyo 1620042, Japan

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Encoder-decoder-based attractors; end-to-end neural diarization; intermediate objectives; non-autoregressive models; self-conditioning; speaker diarization;

D O I：

10.1109/ACCESS.2023.3340307

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the self-conditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with self-conditioning, leading to superior performance compared with existing diarization models.

引用

页码：140069 / 140076

页数：8

共 50 条

[1] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
Fujita, Yusuke
Kanda, Naoyuki
Horiguchi, Shota
Xue, Yawen
Nagamatsu, Kenji
Watanabe, Shinji
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
[2] TOWARDS END-TO-END SPEAKER DIARIZATION WITH GENERALIZED NEURAL SPEAKER CLUSTERING
Zhang, Chunlei
Shi, Jiatong
Weng, Chao
Yu, Meng
Yu, Dong
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8372 - 8376
[3] End-to-End Audio-Visual Neural Speaker Diarization
He, Mao-kui
Du, Jun
Lee, Chin-Hui
INTERSPEECH 2022, 2022, : 1461 - 1465
[4] Robust End-to-end Speaker Diarization with Generic Neural Clustering
Yang, Chenyu
Wang, Yu
INTERSPEECH 2022, 2022, : 1471 - 1475
[5] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
Rybicka, Magdalena
Villalba, Jesus
Thebaud, Thomas
Dehak, Najim
Kowalczyk, Konrad
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
[6] End-To-End Neural Speaker Diarization Through Step-Function
Latypov, Rustam
Stolov, Evgeni
2021 IEEE 15TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2021), 2021,
[7] End-to-End Neural Speaker Diarization with Permutation-Free Objectives
Fujita, Yusuke
Kanda, Naoyuki
Horiguchi, Shota
Nagamatsu, Kenji
Watanabe, Shinji
INTERSPEECH 2019, 2019, : 4300 - 4304
[8] ONLINE END-TO-END NEURAL DIARIZATION WITH SPEAKER-TRACING BUFFER
Xue, Yawen
Horiguchi, Shota
Fujita, Yusuke
Watanabe, Shinji
Garcia, Paola
Nagamatsu, Kenji
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 841 - 848
[9] End-to-end neural speaker diarization with an iterative adaptive attractor estimation
Hao, Fengyuan
Li, Xiaodong
Zheng, Chengshi
NEURAL NETWORKS, 2023, 166 : 566 - 578
[10] DIVE: END-TO-END SPEECH DIARIZATION VIA ITERATIVE SPEAKER EMBEDDING
Zeghidour, Neil
Teboul, Olivier
Grangier, David
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 702 - 709

← 1 2 3 4 5 →