Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

被引:1
|
作者
Lin, Yuxiao [1 ]
Du, Zhihao [2 ]
Zhang, Shiliang [2 ]
Yu, Fan [2 ]
Zhao, Zhou [1 ]
Wu, Fei [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China
[2] Speech Lab, Alibaba Grp, Hangzhou, Zhejiang, Peoples R China
关键词
speaker-attributed ASR; multi-target speech separation; IDENTIFICATION; EXTRACTION; FEATURES;
D O I
10.1109/ISCSLP57327.2022.10037902
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a joint framework for speaker-attributed automatic speech recognition (SA-ASR) task named Separate-to-Recognize. The proposed framework combines multi-target speech separation and speech recognition modules into a single end-to-end model. It takes mixed speech utterances and target-speaker embeddings as input and predicts separated speech and transcription for each speaker. In the multi-target speech separation module, mixed speakers are separated at the same time, which is different from existing single-target separation methods. Furthermore, we develop a dual-path Conformer-based separator which improves dual-path time domain separation by utilizing the modeling ability of local relationship from Conformer. We also explore different schemas for joint training modules and propose a training strategy that can better coordinate the two modules in our model. By comparing with different model structures and training strategies in experiments, we demonstrate the effectiveness of the proposed multi-target separation module and dual-path Conformer based separator. Experimental results also show that our framework can be generalized to different neural network architectures.
引用
收藏
页码:150 / 154
页数:5
相关论文
共 15 条
  • [1] A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings
    Yu, Fan
    Du, Zhihao
    Zhang, Shiliang
    Lin, Yuxiao
    Xie, Lei
    [J]. INTERSPEECH 2022, 2022, : 560 - 564
  • [2] A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings
    Shi, Mohan
    Zhang, Jie
    Du, Zhihao
    Yu, Fan
    Chen, Qian
    Zhang, Shiliang
    Dai, Li-Rong
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1943 - 1948
  • [3] Two-Stage Multi-Target Joint Learning for Monaural Speech Separation
    Nie, Shuai
    Liang, Shan
    Xue, Wei
    Zhang, Xueliang
    Liu, Wenju
    Dong, Like
    Yang, Hong
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1503 - 1507
  • [4] Multi-target Ensemble Learning for Monaural Speech Separation
    Zhang, Hui
    Zhang, Xueliang
    Gao, Guanglai
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1958 - 1962
  • [5] Streaming Multi-talker Speech Recognition with Joint Speaker Identification
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2021, 2021, : 1782 - 1786
  • [6] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Dai, Li-Rung
    Lee, Chin-Hui
    [J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [7] Joint Optimization of Denoising Autoencoder and DNN Acoustic Model Based on Multi-target Learning for Noisy Speech Recognition
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3803 - 3807
  • [8] A Speech Enhancement Neural Network Architecture with SNR-Progressive Multi-Target Learning for Robust Speech Recognition
    Zhou, Nan
    Du, Jun
    Tu, Yan-Hui
    Gao, Tian
    Lee, Chin-Hui
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 873 - 877
  • [9] PROGRESSIVE MULTI-TARGET NETWORK BASED SPEECH ENHANCEMENT WITH SNR-PRESELECTION FOR ROBUST SPEAKER DIARIZATION
    Sun, Lei
    Du, Jun
    Zhang, Xueyang
    Gao, Tian
    Fang, Xin
    Lee, Chin-Hui
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7099 - 7103
  • [10] DIRECTIONAL ASR: A NEW PARADIGM FOR E2E MULTI-SPEAKER SPEECH RECOGNITION WITH SOURCE LOCALIZATION
    Subramanian, Aswin Shanmugam
    Weng, Chao
    Watanabe, Shinji
    Yu, Meng
    Xu, Yong
    Zhang, Shi-Xiong
    Yu, Dong
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8433 - 8437