Speaker extraction network with attention mechanism for speech dialogue system

被引:1
|
作者
Hao, Yun [1 ]
Wu, Jiaju [1 ]
Huang, Xiangkang [1 ]
Zhang, Zijia [1 ]
Liu, Fei [1 ]
Wu, Qingyao [1 ,2 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
[2] Pazhou Lab, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech dialogue system; Speech separation; Multi-task; Attention; SEPARATION; ENHANCEMENT;
D O I
10.1007/s11761-022-00340-w
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Speech Dialogue System is currently widely used in various fields. Users can interact and communicate with the system through natural language. While in practical situations, there exist third-person background sounds and background noise interference in real dialogue scenes. This issue seriously damages the intelligibility of the speech signal and decreases speech recognition performance. To tackle this, in this paper, we exploit a speech separation method that can help us to separate target speech from complex multi-person speech. We propose a multi-task-attention mechanism, and we select TFCN as our audio feature extraction module. Based on the multi-task method, we use SI-SDR and cross-entropy speaker classification loss function for joint training, and then we use the attention mechanism to further excludes the background vocals in the mixed speech. We not only test our result in Distortion indicators SI-SDR and SDR, but also test with a speech recognition system. To train our model and demonstrate its effectiveness, we build a background vocal removal data set based on a common data set. Experimental results empirically show that our model significantly improves the performance of speech separation model.
引用
收藏
页码:111 / 119
页数:9
相关论文
共 50 条
  • [1] Speaker extraction network with attention mechanism for speech dialogue system
    Yun Hao
    Jiaju Wu
    Xiangkang Huang
    Zijia Zhang
    Fei Liu
    Qingyao Wu
    Service Oriented Computing and Applications, 2022, 16 : 111 - 119
  • [2] Denoi-SpEx plus : A Speaker Extraction Network based Speech Dialogue System
    Hao, Yun
    Huang, Xiangkang
    Huang, Huichou
    Wu, Qingyao
    2021 IEEE INTERNATIONAL CONFERENCE ON E-BUSINESS ENGINEERING (ICEBE 2021), 2021, : 49 - 53
  • [3] Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
    Wang, Wupeng
    Xu, Chenglin
    Ge, Meng
    Li, Haizhou
    INTERSPEECH 2021, 2021, : 3535 - 3539
  • [4] SINGLE-CHANNEL SPEECH EXTRACTION USING SPEAKER INVENTORY AND ATTENTION NETWORK
    Xiao, Xiong
    Chen, Zhuo
    Yoshioka, Takuya
    Erdogan, Hakan
    Liu, Changliang
    Dimitriadis, Dimitrios
    Droppo, Jasha
    Gong, Yifan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 86 - 90
  • [5] SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures
    Zmolikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Ochiai, Tsubasa
    Nakatani, Tomohiro
    Burget, Lukas
    Cernocky, Jan
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 800 - 814
  • [6] Target Speaker Extraction with Attention Enhancement and Gated Fusion Mechanism
    Wang Sijie
    Hamdulla, Askar
    Ablimit, Mijit
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1995 - 2001
  • [7] Speaker Adaptive Training for Speech Recognition Based on Attention-over-Attention Mechanism
    Wan, Genshun
    Pan, Jia
    Wang, Qingran
    Gao, Jianqing
    Ye, Zhongfu
    INTERSPEECH 2020, 2020, : 1251 - 1255
  • [8] Speaker-aware neural network based beamformer for speaker extraction in speech mixtures
    Zmplikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Higuchi, Takuya
    Ogawa, Atsunori
    Nakatani, Tomohiro
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2655 - 2659
  • [9] TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition
    Li, Wenjie
    Zhang, Pengyuan
    Yan, Yonghong
    ELECTRONICS LETTERS, 2019, 55 (14) : 816 - 818
  • [10] Hierarchic Temporal Convolutional Network with Attention Fusion for Target Speaker Extraction
    Chen, Zihao
    Qiu, Wenbo
    Xu, Haitao
    Hu, Ying
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 827 - 832