A two-stage phase-aware approach for monaural multi-talker speech separation

被引:0
|
作者
Yin L. [1 ,2 ]
Li J. [1 ,2 ]
Yan Y. [1 ,2 ,3 ]
Akagi M. [4 ]
机构
[1] University of Chinese Academy of Sciences, Beijing
[2] Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing
[3] Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Xinjiang
[4] Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi-shi
来源
关键词
Amplitude estimation; Deep learning; Mask estimation; Phase recovery; Speech separation;
D O I
10.1587/TRANSINF.2019EDP7259
中图分类号
学科分类号
摘要
The simultaneous utterances impact the ability of both the hearing-impaired persons and automatic speech recognition systems. Recently, deep neural networks have dramatically improved the speech separation performance. However, most previous works only estimate the speech magnitude and use the mixture phase for speech reconstruction. The use of the mixture phase has become a critical limitation for separation performance. This study proposes a two-stage phase-aware approach for multi-talker speech separation, which integrally recovers the magnitude as well as the phase. For the phase recovery, Multiple Input Spectrogram Inversion (MISI) algorithm is utilized due to its effectiveness and simplicity. The study implements the MISI algorithm based on the mask and gives that the ideal amplitude mask (IAM) is the optimal mask for the mask-based MISI phase recovery, which brings less phase distortion. To compensate for the error of phase recovery and minimize the signal distortion, an advanced mask is proposed for the magnitude estimation. The IAM and the proposed mask are estimated at different stages to recover the phase and the magnitude, respectively. Two frameworks of neural network are evaluated for the magnitude estimation on the second stage, demonstrating the effectiveness and flexibility of the proposed approach. The experimental results demonstrate that the proposed approach significantly minimizes the distortions of the separated speech. Copyright © 2020 The Institute of Electronics, Information and Communication Engineers
引用
收藏
页码:1732 / 1743
页数:11
相关论文
共 50 条
  • [31] Influence of competing multi-talker babble on frequency-importance functions for speech measured using a correlational approach
    Gilbert, G
    Micheyl, C
    ACTA ACUSTICA UNITED WITH ACUSTICA, 2005, 91 (01) : 145 - 154
  • [32] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
    Yan-Hui Tu
    Jun Du
    Chin-Hui Lee
    Journal of Signal Processing Systems, 2018, 90 : 963 - 973
  • [33] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
    Tu, Yan-Hui
    Du, Jun
    Lee, Chin-Hui
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2018, 90 (07): : 963 - 973
  • [34] A Two-stage Approach to Speech Bandwidth Extension
    Lin, Ju
    Wang, Yun
    Kalgaonkar, Kaustubh
    Keren, Gil
    Zhang, Didi
    Fuegen, Christian
    INTERSPEECH 2021, 2021, : 1689 - 1693
  • [35] Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments
    Luo, Yiyu
    Wang, Jing
    Xu, Liang
    Yang, Lidong
    INTERSPEECH 2021, 2021, : 1104 - 1108
  • [36] Multi-talker Speech Recognition Based on Blind Source Separation with Ad hoc Microphone Array Using Smartphones and Cloud Storage
    Ochi, Keiko
    Ono, Nobutaka
    Miyabe, Shigeki
    Makino, Shoji
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3369 - 3373
  • [37] A two-stage deep learning algorithm for talker-independent speaker separation in reverberant conditions
    Delfarah, Masood
    Liu, Yuzhou
    Wang, DeLiang
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2020, 148 (03): : 1157 - 1168
  • [38] A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge
    Sun, Lei
    Du, Jun
    Gao, Tian
    Fang, Yi
    Ma, Feng
    Lee, Chin-Hui
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 827 - 840
  • [39] Two-Stage Monaural Source Separation in Reverberant Room Environments Using Deep Neural Networks
    Sun, Yang
    Wang, Wenwu
    Chambers, Jonathon
    Naqvi, Syed Mohsen
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (01) : 125 - 139
  • [40] Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation
    Huang, Lu
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yang, Yi
    Xu, Shumin
    Sun, Jiasong
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1256 - 1261