DIVE: END-TO-END SPEECH DIARIZATION VIA ITERATIVE SPEAKER EMBEDDING

被引:7
|
作者
Zeghidour, Neil [1 ]
Teboul, Olivier [1 ]
Grangier, David [1 ]
机构
[1] Google Res, Brain Team, Mountain View, CA 94043 USA
关键词
diarization; speech; end-to-end learning; SOURCE SEPARATION;
D O I
10.1109/ASRU51503.2021.9688178
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce DIVE, an end-to-end speaker diarization system. DIVE presents the diarization task as an iterative process: it repeatedly builds a representation for each speaker before predicting their voice activity conditioned on the extracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classical permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker representations and jointly optimizes all parameters of the system with a multi-speaker voice activity loss. DIVE does not require the training speaker identities and allows efficient window-based training Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) evaluation. Overall, these contributions yield a system redefining the state-of-the-art on the CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative.
引用
收藏
页码:702 / 709
页数:8
相关论文
共 50 条
  • [1] End-to-end neural speaker diarization with an iterative adaptive attractor estimation
    Hao, Fengyuan
    Li, Xiaodong
    Zheng, Chengshi
    [J]. NEURAL NETWORKS, 2023, 166 : 566 - 578
  • [2] END-TO-END SPEAKER DIARIZATION CONDITIONED ON SPEECH ACTIVITY AND OVERLAP DETECTION
    Takashima, Yuki
    Fujita, Yusuke
    Watanabe, Shinji
    Horiguchi, Shota
    Garcia, Paola
    Nagamatsu, Kenji
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 849 - 856
  • [3] END-TO-END SPEAKER DIARIZATION AS POST-PROCESSING
    Horiguchi, Shota
    Garcia, Paola
    Fujita, Yusuke
    Watanabe, Shinji
    Nagamatsu, Kenji
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7188 - 7192
  • [4] TOWARDS END-TO-END SPEAKER DIARIZATION WITH GENERALIZED NEURAL SPEAKER CLUSTERING
    Zhang, Chunlei
    Shi, Jiatong
    Weng, Chao
    Yu, Meng
    Yu, Dong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8372 - 8376
  • [5] Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization
    Fujita, Yusuke
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    [J]. IEEE ACCESS, 2023, 11 : 140069 - 140076
  • [6] End-to-End Audio-Visual Neural Speaker Diarization
    He, Mao-kui
    Du, Jun
    Lee, Chin-Hui
    [J]. INTERSPEECH 2022, 2022, : 1461 - 1465
  • [7] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Xue, Yawen
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
  • [8] Robust End-to-end Speaker Diarization with Generic Neural Clustering
    Yang, Chenyu
    Wang, Yu
    [J]. INTERSPEECH 2022, 2022, : 1471 - 1475
  • [9] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [10] Robust End-to-end Speaker Diarization with Conformer and Additive Margin Penalty
    Leung, Tsun-Yat
    Samarakoon, Lahiru
    [J]. INTERSPEECH 2021, 2021, : 3575 - 3579