Speaker extraction network with attention mechanism for speech dialogue system

被引:1
|
作者
Hao, Yun [1 ]
Wu, Jiaju [1 ]
Huang, Xiangkang [1 ]
Zhang, Zijia [1 ]
Liu, Fei [1 ]
Wu, Qingyao [1 ,2 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
[2] Pazhou Lab, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech dialogue system; Speech separation; Multi-task; Attention; SEPARATION; ENHANCEMENT;
D O I
10.1007/s11761-022-00340-w
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Speech Dialogue System is currently widely used in various fields. Users can interact and communicate with the system through natural language. While in practical situations, there exist third-person background sounds and background noise interference in real dialogue scenes. This issue seriously damages the intelligibility of the speech signal and decreases speech recognition performance. To tackle this, in this paper, we exploit a speech separation method that can help us to separate target speech from complex multi-person speech. We propose a multi-task-attention mechanism, and we select TFCN as our audio feature extraction module. Based on the multi-task method, we use SI-SDR and cross-entropy speaker classification loss function for joint training, and then we use the attention mechanism to further excludes the background vocals in the mixed speech. We not only test our result in Distortion indicators SI-SDR and SDR, but also test with a speech recognition system. To train our model and demonstrate its effectiveness, we build a background vocal removal data set based on a common data set. Experimental results empirically show that our model significantly improves the performance of speech separation model.
引用
收藏
页码:111 / 119
页数:9
相关论文
共 50 条
  • [21] TIME-DOMAIN SPEECH EXTRACTION WITH SPATIAL INFORMATION AND MULTI SPEAKER CONDITIONING MECHANISM
    Zhang, Jisi
    Zorila, Catalin
    Doddipatla, Rama
    Barker, Jon
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6084 - 6088
  • [22] A SPEAKER-INDEPENDENT SPEECH RECOGNITION SYSTEM FOR TELEPHONE NETWORK APPLICATIONS
    TRNKA, R
    REVUE TECHNIQUE THOMSON-CSF, 1984, 16 (04): : 847 - 861
  • [23] A Deep Neural Network Speaker Verification System Targeting Microphone Speech
    Lei, Yun
    Ferrer, Luciana
    McLaren, Mitchell
    Scheffer, Nicolas
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 681 - 685
  • [24] Triplet Network with Attention for Speaker Diarization
    Song, Huan
    Willi, Megan
    Thiagarajan, Jayaraman J.
    Berisha, Visar
    Spanias, Andreas
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3608 - 3612
  • [25] Speech recognition as feature extraction for speaker recognition
    Stolcke, A.
    Shriberg, E.
    Ferrer, L.
    Kajarekar, S.
    Sonmez, K.
    Tur, G.
    2007 IEEE WORKSHOP ON SIGNAL PROCESSING APPLICATIONS FOR PUBLIC SECURITY AND FORENSICS, 2007, : 39 - +
  • [26] SPEAKER ACTIVITY DRIVEN NEURAL SPEECH EXTRACTION
    Delcroix, Marc
    Zmolikova, Katerina
    Ochiai, Tsubasa
    Kinoshita, Keisuke
    Nakatani, Tomohiro
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6099 - 6103
  • [27] MULTIMODAL ATTENTION FUSION FOR TARGET SPEAKER EXTRACTION
    Sato, Hiroshi
    Ochiai, Tsubasa
    Kinoshita, Keisuke
    Delcroix, Marc
    Nakatani, Tomohiro
    Araki, Shoko
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 778 - 784
  • [28] A speech-and-speaker identification system: Feature extraction, description, and classification of speech-signal image
    Saeed, Khalid
    Nammous, Mohammad Kheir
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2007, 54 (02) : 887 - 897
  • [29] Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech
    Zajic, Zbynek
    Zelinka, Jan
    Mueller, Ludek
    SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 555 - 563
  • [30] Speech extraction of a target speaker from one mixed speech signal
    Azetsu, Tadahiro
    Uchino, Eiji
    Suetake, Noriaki
    IEEJ Transactions on Electronics, Information and Systems, 2007, 127 (06) : 970 - 971