Speaker extraction network with attention mechanism for speech dialogue system

被引：1

作者：

Hao, Yun ^{[1
]}

Wu, Jiaju ^{[1
]}

Huang, Xiangkang ^{[1
]}

Zhang, Zijia ^{[1
]}

Liu, Fei ^{[1
]}

Wu, Qingyao ^{[1
,2
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

[2] Pazhou Lab, Guangzhou, Peoples R China

来源：

SERVICE ORIENTED COMPUTING AND APPLICATIONS | 2022年 / 16卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Speech dialogue system; Speech separation; Multi-task; Attention; SEPARATION; ENHANCEMENT;

D O I：

10.1007/s11761-022-00340-w

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Speech Dialogue System is currently widely used in various fields. Users can interact and communicate with the system through natural language. While in practical situations, there exist third-person background sounds and background noise interference in real dialogue scenes. This issue seriously damages the intelligibility of the speech signal and decreases speech recognition performance. To tackle this, in this paper, we exploit a speech separation method that can help us to separate target speech from complex multi-person speech. We propose a multi-task-attention mechanism, and we select TFCN as our audio feature extraction module. Based on the multi-task method, we use SI-SDR and cross-entropy speaker classification loss function for joint training, and then we use the attention mechanism to further excludes the background vocals in the mixed speech. We not only test our result in Distortion indicators SI-SDR and SDR, but also test with a speech recognition system. To train our model and demonstrate its effectiveness, we build a background vocal removal data set based on a common data set. Experimental results empirically show that our model significantly improves the performance of speech separation model.

引用

页码：111 / 119

页数：9

共 50 条

[21] TIME-DOMAIN SPEECH EXTRACTION WITH SPATIAL INFORMATION AND MULTI SPEAKER CONDITIONING MECHANISM
Zhang, Jisi
Zorila, Catalin
Doddipatla, Rama
Barker, Jon
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6084 - 6088
[22] A SPEAKER-INDEPENDENT SPEECH RECOGNITION SYSTEM FOR TELEPHONE NETWORK APPLICATIONS
TRNKA, R
REVUE TECHNIQUE THOMSON-CSF, 1984, 16 (04): : 847 - 861
[23] A Deep Neural Network Speaker Verification System Targeting Microphone Speech
Lei, Yun
Ferrer, Luciana
McLaren, Mitchell
Scheffer, Nicolas
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 681 - 685
[24] Triplet Network with Attention for Speaker Diarization
Song, Huan
Willi, Megan
Thiagarajan, Jayaraman J.
Berisha, Visar
Spanias, Andreas
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3608 - 3612
[25] Speech recognition as feature extraction for speaker recognition
Stolcke, A.
Shriberg, E.
Ferrer, L.
Kajarekar, S.
Sonmez, K.
Tur, G.
2007 IEEE WORKSHOP ON SIGNAL PROCESSING APPLICATIONS FOR PUBLIC SECURITY AND FORENSICS, 2007, : 39 - +
[26] SPEAKER ACTIVITY DRIVEN NEURAL SPEECH EXTRACTION
Delcroix, Marc
Zmolikova, Katerina
Ochiai, Tsubasa
Kinoshita, Keisuke
Nakatani, Tomohiro
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6099 - 6103
[27] MULTIMODAL ATTENTION FUSION FOR TARGET SPEAKER EXTRACTION
Sato, Hiroshi
Ochiai, Tsubasa
Kinoshita, Keisuke
Delcroix, Marc
Nakatani, Tomohiro
Araki, Shoko
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 778 - 784
[28] A speech-and-speaker identification system: Feature extraction, description, and classification of speech-signal image
Saeed, Khalid
Nammous, Mohammad Kheir
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2007, 54 (02) : 887 - 897
[29] Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech
Zajic, Zbynek
Zelinka, Jan
Mueller, Ludek
SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 555 - 563
[30] Speech extraction of a target speaker from one mixed speech signal
Azetsu, Tadahiro
Uchino, Eiji
Suetake, Noriaki
IEEJ Transactions on Electronics, Information and Systems, 2007, 127 (06) : 970 - 971

← 1 2 3 4 5 →