Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

被引：1

作者：

Lin, Yuxiao ^{[1
]}

Du, Zhihao ^{[2
]}

Zhang, Shiliang ^{[2
]}

Yu, Fan ^{[2
]}

Zhao, Zhou ^{[1
]}

Wu, Fei ^{[1
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China

[2] Speech Lab, Alibaba Grp, Hangzhou, Zhejiang, Peoples R China

来源：

2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2022年

关键词：

speaker-attributed ASR; multi-target speech separation; IDENTIFICATION; EXTRACTION; FEATURES;

D O I：

10.1109/ISCSLP57327.2022.10037902

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose a joint framework for speaker-attributed automatic speech recognition (SA-ASR) task named Separate-to-Recognize. The proposed framework combines multi-target speech separation and speech recognition modules into a single end-to-end model. It takes mixed speech utterances and target-speaker embeddings as input and predicts separated speech and transcription for each speaker. In the multi-target speech separation module, mixed speakers are separated at the same time, which is different from existing single-target separation methods. Furthermore, we develop a dual-path Conformer-based separator which improves dual-path time domain separation by utilizing the modeling ability of local relationship from Conformer. We also explore different schemas for joint training modules and propose a training strategy that can better coordinate the two modules in our model. By comparing with different model structures and training strategies in experiments, we demonstrate the effectiveness of the proposed multi-target separation module and dual-path Conformer based separator. Experimental results also show that our framework can be generalized to different neural network architectures.

引用

页码：150 / 154

页数：5

共 15 条

[1] A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings
Yu, Fan
Du, Zhihao
Zhang, Shiliang
Lin, Yuxiao
Xie, Lei
[J]. INTERSPEECH 2022, 2022, : 560 - 564
[2] A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings
Shi, Mohan
Zhang, Jie
Du, Zhihao
Yu, Fan
Chen, Qian
Zhang, Shiliang
Dai, Li-Rong
[J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1943 - 1948
[3] Two-Stage Multi-Target Joint Learning for Monaural Speech Separation
Nie, Shuai
Liang, Shan
Xue, Wei
Zhang, Xueliang
Liu, Wenju
Dong, Like
Yang, Hong
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1503 - 1507
[4] Multi-target Ensemble Learning for Monaural Speech Separation
Zhang, Hui
Zhang, Xueliang
Gao, Guanglai
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1958 - 1962
[5] Streaming Multi-talker Speech Recognition with Joint Speaker Identification
Lu, Liang
Kanda, Naoyuki
Li, Jinyu
Gong, Yifan
[J]. INTERSPEECH 2021, 2021, : 1782 - 1786
[6] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
Tu, Yan-Hui
Du, Jun
Dai, Li-Rung
Lee, Chin-Hui
[J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[7] Joint Optimization of Denoising Autoencoder and DNN Acoustic Model Based on Multi-target Learning for Noisy Speech Recognition
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3803 - 3807
[8] A Speech Enhancement Neural Network Architecture with SNR-Progressive Multi-Target Learning for Robust Speech Recognition
Zhou, Nan
Du, Jun
Tu, Yan-Hui
Gao, Tian
Lee, Chin-Hui
[J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 873 - 877
[9] PROGRESSIVE MULTI-TARGET NETWORK BASED SPEECH ENHANCEMENT WITH SNR-PRESELECTION FOR ROBUST SPEAKER DIARIZATION
Sun, Lei
Du, Jun
Zhang, Xueyang
Gao, Tian
Fang, Xin
Lee, Chin-Hui
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7099 - 7103
[10] DIRECTIONAL ASR: A NEW PARADIGM FOR E2E MULTI-SPEAKER SPEECH RECOGNITION WITH SOURCE LOCALIZATION
Subramanian, Aswin Shanmugam
Weng, Chao
Watanabe, Shinji
Yu, Meng
Xu, Yong
Zhang, Shi-Xiong
Yu, Dong
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8433 - 8437

← 1 2 →