End-to-end Neural Diarization: From Transformer to Conformer

被引:15
|
作者
Liu, Yi Chieh [1 ,3 ]
Han, Eunjung [2 ]
Lee, Chul [2 ]
Stolcke, Andreas [2 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Amazon Alexa Speech, Sunnyvale, CA USA
[3] Amazon, Sunnyvale, CA USA
来源
关键词
diarization; transformer; conformer;
D O I
10.21437/Interspeech.2021-1909
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We propose a new end-to-end neural diarization (EEND) system that is based on Conformer, a recently proposed neural architecture that combines convolutional mappings and Transformer to model both local and global dependencies in speech. We first show that data augmentation and convolutional subsampling layers enhance the original self-attentive EEND in the Transformer-based EEND, and then Conformer gives an additional gain over the Transformer-based EEND. However, we notice that the Conformer-based EEND does not generalize as well from simulated to real conversation data as the Transformer-based model. This leads us to quantify the mismatch between simulated data and real speaker behavior in terms of temporal statistics reflecting turn-taking between speakers, and investigate its correlation with diarization error. By mixing simulated and real data in EEND training, we mitigate the mismatch further, with Conformer-based EEND achieving 24% error reduction over the baseline SA-EEND system, and 10% improvement over the best augmented Transformer-based system, on two-speaker CALLHOME data.
引用
收藏
页码:3081 / 3085
页数:5
相关论文
共 50 条
  • [1] Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization
    Jiao, Xiaolin
    Chen, Yaqi
    Qu, Dan
    Yang, Xukui
    [J]. ELECTRONICS, 2023, 12 (19)
  • [2] Robust End-to-end Speaker Diarization with Conformer and Additive Margin Penalty
    Leung, Tsun-Yat
    Samarakoon, Lahiru
    [J]. INTERSPEECH 2021, 2021, : 3575 - 3579
  • [3] ASR-AWARE END-TO-END NEURAL DIARIZATION
    Khare, Aparna
    Han, Eunjung
    Yang, Yuguang
    Stolcke, Andreas
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8092 - 8096
  • [4] AUXILIARY LOSS OF TRANSFORMER WITH RESIDUAL CONNECTION FOR END-TO-END SPEAKER DIARIZATION
    Yu, Yechan
    Park, Dongkeon
    Kim, Hong Kook
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8377 - 8381
  • [5] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Xue, Yawen
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
  • [6] End-to-End Audio-Visual Neural Speaker Diarization
    He, Mao-kui
    Du, Jun
    Lee, Chin-Hui
    [J]. INTERSPEECH 2022, 2022, : 1461 - 1465
  • [7] Robust End-to-end Speaker Diarization with Generic Neural Clustering
    Yang, Chenyu
    Wang, Yu
    [J]. INTERSPEECH 2022, 2022, : 1471 - 1475
  • [8] From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
    Landini, Federico
    Lozano-Diez, Alicia
    Diez, Mireia
    Burget, Lukas
    [J]. INTERSPEECH 2022, 2022, : 5095 - 5099
  • [9] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [10] Encoder-Decoder Based Attractors for End-to-End Neural Diarization
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Xue, Yawen
    Garcia, Paola
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1493 - 1507