ONLINE END-TO-END NEURAL DIARIZATION WITH SPEAKER-TRACING BUFFER

被引：22

作者：

Xue, Yawen ^{[1
]}

Horiguchi, Shota ^{[1
]}

Fujita, Yusuke ^{[1
]}

Watanabe, Shinji ^{[2
]}

Garcia, Paola ^{[2
]}

Nagamatsu, Kenji ^{[1
]}

机构：

[1] Hitachi Ltd, Res & Dev Grp, Tokyo, Japan

[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Online speaker diarization; speaker-tracing buffer; end-to-end; self-attention;

D O I：

10.1109/SLT48900.2021.9383523

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12:54% for CALLHOME and 20:77% for CSJ with 1:4 s actual latency.

引用

页码：841 / 848

页数：8

共 50 条

[31] DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION
Snyder, David
Ghahremani, Pegah
Povey, Daniel
Garcia-Romero, Daniel
Carmiel, Yishay
Khudanpur, Sanjeev
[J]. 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 165 - 170
[32] END-TO-END DIARIZATION FOR VARIABLE NUMBER OF SPEAKERS WITH LOCAL-GLOBAL NETWORKS AND DISCRIMINATIVE SPEAKER EMBEDDINGS
Maiti, Soumi
Erdogan, Hakan
Wilson, Kevin
Wisdom, Scott
Watanabe, Shinji
Hershey, John R.
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7183 - 7187
[33] MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION
Horiguchi, Shota
Takashima, Yuki
Watanabe, Shinji
Garcia, Paola
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 620 - 625
[34] From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
Landini, Federico
Lozano-Diez, Alicia
Diez, Mireia
Burget, Lukas
[J]. INTERSPEECH 2022, 2022, : 5095 - 5099
[35] Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
Takashima, Yuki
Fujita, Yusuke
Horiguchi, Shota
Watanabe, Shinji
Garcia, Paola
Nagamatsu, Kenji
[J]. INTERSPEECH 2021, 2021, : 3096 - 3100
[36] Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition
Miguel, Antonio
Llombart, Jorge
Ortega, Alfonso
Lleida, Eduardo
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2819 - 2823
[37] End-to-End Chinese Speaker Identification
Yu, Dian
Zhou, Ben
Yu, Dong
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2274 - 2285
[38] End-to-End Active Speaker Detection
Alcazar, Juan Leon
Cordes, Moritz
Zhao, Chen
Ghanem, Bernard
[J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 126 - 143
[39] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
Yang, Chenyu
Chen, Mengxi
Wang, Yanfeng
Wang, Yu
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041
[40] INTEGRATING END-TO-END NEURAL AND CLUSTERING-BASED DIARIZATION: GETTING THE BEST OF BOTH WORLDS
Kinoshita, Keisuke
Delcroix, Marc
Tawara, Naohiro
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7198 - 7202

← 1 2 3 4 5 →