ONLINE END-TO-END NEURAL DIARIZATION WITH SPEAKER-TRACING BUFFER

被引:22
|
作者
Xue, Yawen [1 ]
Horiguchi, Shota [1 ]
Fujita, Yusuke [1 ]
Watanabe, Shinji [2 ]
Garcia, Paola [2 ]
Nagamatsu, Kenji [1 ]
机构
[1] Hitachi Ltd, Res & Dev Grp, Tokyo, Japan
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
关键词
Online speaker diarization; speaker-tracing buffer; end-to-end; self-attention;
D O I
10.1109/SLT48900.2021.9383523
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12:54% for CALLHOME and 20:77% for CSJ with 1:4 s actual latency.
引用
收藏
页码:841 / 848
页数:8
相关论文
共 50 条
  • [31] DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION
    Snyder, David
    Ghahremani, Pegah
    Povey, Daniel
    Garcia-Romero, Daniel
    Carmiel, Yishay
    Khudanpur, Sanjeev
    [J]. 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 165 - 170
  • [32] END-TO-END DIARIZATION FOR VARIABLE NUMBER OF SPEAKERS WITH LOCAL-GLOBAL NETWORKS AND DISCRIMINATIVE SPEAKER EMBEDDINGS
    Maiti, Soumi
    Erdogan, Hakan
    Wilson, Kevin
    Wisdom, Scott
    Watanabe, Shinji
    Hershey, John R.
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7183 - 7187
  • [33] MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION
    Horiguchi, Shota
    Takashima, Yuki
    Watanabe, Shinji
    Garcia, Paola
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 620 - 625
  • [34] From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
    Landini, Federico
    Lozano-Diez, Alicia
    Diez, Mireia
    Burget, Lukas
    [J]. INTERSPEECH 2022, 2022, : 5095 - 5099
  • [35] Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
    Takashima, Yuki
    Fujita, Yusuke
    Horiguchi, Shota
    Watanabe, Shinji
    Garcia, Paola
    Nagamatsu, Kenji
    [J]. INTERSPEECH 2021, 2021, : 3096 - 3100
  • [36] Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition
    Miguel, Antonio
    Llombart, Jorge
    Ortega, Alfonso
    Lleida, Eduardo
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2819 - 2823
  • [37] End-to-End Chinese Speaker Identification
    Yu, Dian
    Zhou, Ben
    Yu, Dong
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2274 - 2285
  • [38] End-to-End Active Speaker Detection
    Alcazar, Juan Leon
    Cordes, Moritz
    Zhao, Chen
    Ghanem, Bernard
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 126 - 143
  • [39] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
    Yang, Chenyu
    Chen, Mengxi
    Wang, Yanfeng
    Wang, Yu
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041
  • [40] INTEGRATING END-TO-END NEURAL AND CLUSTERING-BASED DIARIZATION: GETTING THE BEST OF BOTH WORLDS
    Kinoshita, Keisuke
    Delcroix, Marc
    Tawara, Naohiro
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7198 - 7202