Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization

被引:1
|
作者
Jeoung, Ye-Rin [1 ]
Choi, Jeong-Hwan [1 ]
Seong, Ju-Seok [1 ]
Kyung, JeHyun [1 ]
Chang, Joon-Hyuk [1 ]
机构
[1] Hanyang Univ, Dept Elect Engn, Seoul, South Korea
来源
关键词
speaker diarization; end-to-end neural diarization; self-attention mechanism; fine-tuning; self-distillation;
D O I
10.21437/Interspeech.2023-1404
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this study, we explore self-distillation (SD) techniques to improve the performance of the transformer-encoder-based selfattentive (SA) end-to-end neural speaker diarization (EEND). We first apply the SD approaches, introduced in the automatic speech recognition field, to the SA-EEND model to confirm their potential for speaker diarization. Then, we propose two novel SD methods for the SA-EEND, which distill the prediction output of the model or the SA heads of the upper blocks into the SA heads of the lower blocks. Consequently, we expect the high-level speaker-discriminative knowledge learned by the upper blocks to be shared across the lower blocks, thereby enabling the SA heads of the lower blocks to effectively capture the discriminative patterns of overlapped speech of multiple speakers. Experimental results on the simulated and CALL-HOME datasets show that the SD generally improves the baseline performance, and the proposed methods outperform the conventional SD approaches.
引用
收藏
页码:3197 / 3201
页数:5
相关论文
共 50 条
  • [31] ONLINE END-TO-END NEURAL DIARIZATION WITH SPEAKER-TRACING BUFFER
    Xue, Yawen
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Garcia, Paola
    Nagamatsu, Kenji
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 841 - 848
  • [32] ESTMST-ST: An End-to-End Soft Threshold and Multiloss Self-Distillation Based Swin Transformer for Underwater Acoustic Signal Recognition
    Wu, Fan
    Yao, Haiyang
    Zhao, Zhongda
    Zhao, Xiaobo
    Zang, Yuzhang
    Wang, Haiyan
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [33] Very Deep Self-Attention Networks for End-to-End Speech Recognition
    Ngoc-Quan Pham
    Thai-Son Nguyen
    Niehues, Jan
    Mueller, Markus
    Waibel, Alex
    INTERSPEECH 2019, 2019, : 66 - 70
  • [34] A Novel End-to-End Corporate Credit Rating Model Based on Self-Attention Mechanism
    Chen, Binbin
    Long, Shengjie
    IEEE ACCESS, 2020, 8 (08): : 203876 - 203889
  • [35] Speaker diarization with variants of self-attention and joint speaker embedding extractor
    Fu, Pengbin
    Ma, Yuchen
    Yang, Huirong
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (05) : 9169 - 9180
  • [36] Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR Model
    Xu, Qiang
    Song, Tongtong
    Wang, Longbiao
    Shi, Hao
    Lin, Ynqin
    Lv, Yongjie
    Ge, Meng
    Yu, Qiang
    Dang, Jianwu
    INTERSPEECH 2022, 2022, : 1716 - 1720
  • [37] A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations
    Ma, Hui
    Wang, Jian
    Lin, Hongfei
    Zhang, Bo
    Zhang, Yijia
    Xu, Bo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 776 - 788
  • [38] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
    Xu, Menglong
    Li, Shengqiang
    Zhang, Xiao-Lei
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
  • [39] Transformer-based end-to-end attack on text CAPTCHAs with triplet deep attention
    Zhang, Bo
    Xiong, Yu-Jie
    Xia, Chunming
    Gao, Yongbin
    COMPUTERS & SECURITY, 2024, 146
  • [40] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
    Miao, Haoran
    Cheng, Gaofeng
    Gao, Changfeng
    Zhang, Pengyuan
    Yan, Yonghong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088