ASR-AWARE END-TO-END NEURAL DIARIZATION

被引:3
|
作者
Khare, Aparna [1 ]
Han, Eunjung [1 ]
Yang, Yuguang [1 ]
Stolcke, Andreas [1 ]
机构
[1] Amazon Alexa AI, Sunnyvale, CA 94089 USA
关键词
diarization; automatic speech recognition; multi-task learning; SPEAKER DIARIZATION;
D O I
10.1109/ICASSP43922.2022.9746964
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline.
引用
收藏
页码:8092 / 8096
页数:5
相关论文
共 50 条
  • [1] SPEAKER AND LANGUAGE AWARE TRAINING FOR END-TO-END ASR
    Bansal, Shubham
    Malhotra, Karan
    Ganapathy, Sriram
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 494 - 501
  • [2] End-to-end Neural Diarization: From Transformer to Conformer
    Liu, Yi Chieh
    Han, Eunjung
    Lee, Chul
    Stolcke, Andreas
    [J]. INTERSPEECH 2021, 2021, : 3081 - 3085
  • [3] OVERLAP-AWARE DIARIZATION: RESEGMENTATION USING NEURAL END-TO-END OVERLAPPED SPEECH DETECTION
    Bullock, Latane
    Bredin, Herve
    Garcia-Perera, Leibny Paola
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7114 - 7118
  • [4] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
    Fujita, Yusuke
    Kanda, Naoyuki
    Horiguchi, Shota
    Xue, Yawen
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
  • [5] End-to-End Audio-Visual Neural Speaker Diarization
    He, Mao-kui
    Du, Jun
    Lee, Chin-Hui
    [J]. INTERSPEECH 2022, 2022, : 1461 - 1465
  • [6] Robust End-to-end Speaker Diarization with Generic Neural Clustering
    Yang, Chenyu
    Wang, Yu
    [J]. INTERSPEECH 2022, 2022, : 1471 - 1475
  • [7] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [8] Spelling-Aware Word-Based End-to-End ASR
    Egorova, Ekaterina
    Vydana, Hari Krishna
    Burget, Lukas
    Cernocky, Jan Honza
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1729 - 1733
  • [9] Encoder-Decoder Based Attractors for End-to-End Neural Diarization
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Xue, Yawen
    Garcia, Paola
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1493 - 1507
  • [10] End-To-End Neural Speaker Diarization Through Step-Function
    Latypov, Rustam
    Stolov, Evgeni
    [J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2021), 2021,