WHAM!: Extending Speech Separation to Noisy Environments

被引:153
|
作者
Wichern, Gordon [1 ]
Antognini, Joe [2 ]
Flynn, Michael [2 ]
Zhu, Licheng Richard [2 ]
McQuinn, Emmett [2 ]
Crow, Dwight [2 ]
Manilow, Ethan [1 ]
Le Roux, Jonathan [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
[2] Whisper Ai, San Francisco, CA USA
来源
关键词
source separation; speech enhancement; cocktail party problem; deep clustering; mask inference;
D O I
10.21437/Interspeech.2019-2821
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recent progress in separating the speech signals from multiple overlapping speakers using a single audio channel has brought us closer to solving the cocktail party problem. However, most studies in this area use a constrained problem setup, comparing performance when speakers overlap almost completely, at artificially low sampling rates, and with no external background noise. In this paper, we strive to move the field towards more realistic and challenging scenarios. To that end, we created the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples. The samples were collected in coffee shops, restaurants, and bars in the San Francisco Bay Area, and are made publicly available. We benchmark various speech separation architectures and objective functions to evaluate their robustness to noise. While separation performance decreases as a result of noise, we still observe substantial gains relative to the noisy signals for most approaches.
引用
收藏
页码:1368 / 1372
页数:5
相关论文
共 50 条
  • [31] An Investigation into Audiovisual Speech Correlation in Reverberant Noisy Environments
    Cifani, Simone
    Abel, Andrew
    Hussain, Amir
    Squartini, Stefano
    Piazza, Francesco
    CROSS-MODAL ANALYSIS OF SPEECH, GESTURES, GAZE AND FACIAL EXPRESSIONS, 2009, 5641 : 331 - +
  • [32] SPEECH RECOGNITION IN NOISY ENVIRONMENTS WITH THE AID OF MICROPHONE ARRAYS
    VANCOMPERNOLLE, D
    MA, W
    XIE, F
    VANDIEST, M
    SPEECH COMMUNICATION, 1990, 9 (5-6) : 433 - 442
  • [33] Speech signal modification to increase intelligibility in noisy environments
    Yoo, Sungyub D.
    Boston, J. Robert
    El-Jaroudi, Amro
    Li, Ching-Chung
    Durrant, John D.
    Kovacyk, Kristie
    Shaiman, Susan
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2007, 122 (02): : 1138 - 1149
  • [34] Beamforming microphone arrays for speech acquisition in noisy environments
    Fischer, S
    Simmer, KU
    SPEECH COMMUNICATION, 1996, 20 (3-4) : 215 - 227
  • [35] Transfer Learning for Speech Intelligibility Improvement in Noisy Environments
    Biswas, Ritujoy
    Nathwani, Karan
    Abrol, Vinayak
    INTERSPEECH 2021, 2021, : 176 - 180
  • [36] Speech Emotion Recognition Based on EMD in Noisy Environments
    Chu, Yunyun
    Xiong, Weihua
    Chen, Wei
    ADVANCES IN CIVIL ENGINEERING AND BUILDING MATERIALS III, 2014, 831 : 460 - 464
  • [37] TDOA ESTIMATION OF SPEECH SOURCE IN NOISY REVERBERANT ENVIRONMENTS
    Bu, Suliang
    Zhao, Tuo
    Zhao, Yunxin
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1059 - 1066
  • [38] Multi-band speech recognition in noisy environments
    Okawa, S
    Bocchieri, E
    Potamianos, A
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 641 - 644
  • [39] New adaptive structures for speech enhancement in noisy environments
    Martins, CR
    Piedade, MS
    42ND MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, PROCEEDINGS, VOLS 1 AND 2, 1999, : 241 - 244
  • [40] MULTICHANNEL ONLINE SPEECH DEREVERBERATION UNDER NOISY ENVIRONMENTS
    Togami, Masahito
    2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1078 - 1082