Speaker extraction network with attention mechanism for speech dialogue system

被引：1

作者：

Hao, Yun ^{[1
]}

Wu, Jiaju ^{[1
]}

Huang, Xiangkang ^{[1
]}

Zhang, Zijia ^{[1
]}

Liu, Fei ^{[1
]}

Wu, Qingyao ^{[1
,2
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

[2] Pazhou Lab, Guangzhou, Peoples R China

来源：

SERVICE ORIENTED COMPUTING AND APPLICATIONS | 2022年 / 16卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Speech dialogue system; Speech separation; Multi-task; Attention; SEPARATION; ENHANCEMENT;

D O I：

10.1007/s11761-022-00340-w

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Speech Dialogue System is currently widely used in various fields. Users can interact and communicate with the system through natural language. While in practical situations, there exist third-person background sounds and background noise interference in real dialogue scenes. This issue seriously damages the intelligibility of the speech signal and decreases speech recognition performance. To tackle this, in this paper, we exploit a speech separation method that can help us to separate target speech from complex multi-person speech. We propose a multi-task-attention mechanism, and we select TFCN as our audio feature extraction module. Based on the multi-task method, we use SI-SDR and cross-entropy speaker classification loss function for joint training, and then we use the attention mechanism to further excludes the background vocals in the mixed speech. We not only test our result in Distortion indicators SI-SDR and SDR, but also test with a speech recognition system. To train our model and demonstrate its effectiveness, we build a background vocal removal data set based on a common data set. Experimental results empirically show that our model significantly improves the performance of speech separation model.

引用

页码：111 / 119

页数：9

共 50 条

[41] Event Temporal Relation Extraction with Attention Mechanism and Graph Neural Network
Xiaoliang Xu
Tong Gao
Yuxiang Wang
Xinle Xuan
Tsinghua Science and Technology, 2022, 27 (01) : 79 - 90
[42] ATTENTION MECHANISM IN SPEAKER RECOGNITION: WHAT DOES IT LEARN IN DEEP SPEAKER EMBEDDING?
Wang, Qiongqiong
Okabe, Koji
Lee, Kong Aik
Yamamoto, Hitoshi
Koshinaka, Takafumi
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 1052 - 1059
[43] Enhancing Dialogue-based Relation Extraction by Speaker and TriggerWords Prediction
Zhao, Tianyang
Yan, Zhao
Cao, Yunbo
Li, Zhoujun
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4580 - 4585
[44] LEARNING SPEAKER REPRESENTATION FOR NEURAL NETWORK BASED MULTICHANNEL SPEAKER EXTRACTION
Zmolikova, Katerina
Delcroix, Marc
Kinoshita, Keisuke
Higuchi, Takuya
Ogawa, Atsunori
Nakatani, Tomohiro
2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 8 - 15
[45] Target Speech Extraction: Independent Vector Extraction Guided by Supervised Speaker Identification
Malek, Jiri
Jansky, Jakub
Koldovsky, Zbynek
Kounovsky, Tomas
Cmejla, Jaroslav
Zdansky, Jindrich
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2295 - 2309
[46] Speech-dialogue system for vehicles
Heinrich, C
Stammler, W
ELECTRONIC SYSTEMS FOR VEHICLES, 1996, 1287 : 425 - 441
[47] ONLINE SPEAKER ADAPTATION FOR LVCSR BASED ON ATTENTION MECHANISM
Pan, Jia
Liu, Diyuan
Wan, Genshun
Du, Jun
Liu, Qingfeng
Ye, Zhongfu
2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 183 - 186
[48] Position-Aware Attention Mechanism-Based Bi-graph for Dialogue Relation Extraction
Duan, Guiduo
Dong, Yunrui
Miao, Jiayu
Huang, Tianxi
COGNITIVE COMPUTATION, 2023, 15 (01) : 359 - 372
[49] COMPACT NETWORK FOR SPEAKERBEAM TARGET SPEAKER EXTRACTION
Delcroix, Marc
Zmolikova, Katerina
Ochiai, Tsubasa
Kinoshita, Keisuke
Araki, Shoko
Nakatani, Tomohiro
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6965 - 6969
[50] TIME-DOMAIN SPEAKER EXTRACTION NETWORK
Xu, Chenglin
Rao, Wei
Chng, Eng Siong
Li, Haizhou
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 327 - 334

← 1 2 3 4 5 →