ON TIME-FREQUENCY MASK ESTIMATION FOR MVDR BEAMFORMING WITH APPLICATION IN ROBUST SPEECH RECOGNITION

被引:0
|
作者
Xiao, Xiong [1 ]
Zhao, Shengkui [2 ]
Jones, Douglas L. [2 ]
Chng, Eng Siong [1 ,3 ]
Li, Haizhou [1 ,3 ,4 ,5 ]
机构
[1] Nanyang Technol Univ, Temasek Labs, Singapore, Singapore
[2] Adv Digital Sci Ctr, Singapore, Singapore
[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[4] Natl Univ Singapore, Dept ECE, Singapore, Singapore
[5] ASTAR, Inst Infocomm Res, Singapore, Singapore
关键词
beamforming; robust speech recognition; timefrequency mask; neural networks; long short-term memory;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.
引用
收藏
页码:3246 / 3250
页数:5
相关论文
共 50 条
  • [1] ONLINE MEETING RECOGNITION IN NOISY ENVIRONMENTS WITH TIME-FREQUENCY MASK BASED MVDR BEAMFORMING
    Araki, Shoko
    Ito, Nobutaka
    Delcroix, Marc
    Ogawa, Atsunori
    Kinoshita, Keisuke
    Higuchi, Takuya
    Yoshioka, Takuya
    Dung Tran
    Karita, Shigeki
    Nakatani, Tomohiro
    [J]. 2017 HANDS-FREE SPEECH COMMUNICATIONS AND MICROPHONE ARRAYS (HSCMA 2017), 2017, : 16 - 20
  • [2] ROBUST MVDR BEAMFORMING USING TIME-FREQUENCY MASKS FOR ONLINE/OFFLINE ASR IN NOISE
    Higuchi, Takuya
    Ito, Nobutaka
    Yoshioka, Takuya
    Nakatani, Tomohiro
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5210 - 5214
  • [3] Speech mask estimation using the time-frequency correlation of speech presence
    Zhan, Ge
    Huang, Zhao-Qiong
    Ying, Dong-Wen
    Pan, Jie-Lin
    Yan, Yong-Hong
    [J]. Ruan Jian Xue Bao/Journal of Software, 2016, 27 : 64 - 68
  • [4] TIME-FREQUENCY CONVOLUTIONAL NETWORKS FOR ROBUST SPEECH RECOGNITION
    Mitra, Vikramjit
    Franco, Horacio
    [J]. 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 317 - 323
  • [5] Spectrographic Speech Mask Estimation Using the Time-Frequency Correlation of Speech Presence
    Zhan, Ge
    Huang, Zhaoqiong
    Ying, Dongwen
    Pan, Jielin
    Yan, Yonghong
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2287 - 2291
  • [6] Time-Frequency Masking For Large Scale Robust Speech Recognition
    Wang, Yuxuan
    Misra, Ananya
    Chine, Kean K.
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2469 - 2473
  • [7] Binary and ratio time-frequency masks for robust speech recognition
    Srinivasan, Soundararajan
    Roman, Nicoleta
    Wang, DeLiang
    [J]. SPEECH COMMUNICATION, 2006, 48 (11) : 1486 - 1501
  • [8] Variance based time-frequency mask estimation for unsupervised speech enhancement
    Nasir Saleem
    Muhammad Irfan Khattak
    Gunawan Witjaksono
    Gulzar Ahmad
    [J]. Multimedia Tools and Applications, 2019, 78 : 31867 - 31891
  • [9] Variance based time-frequency mask estimation for unsupervised speech enhancement
    Saleem, Nasir
    Khattak, Muhammad Irfan
    Witjaksono, Gunawan
    Ahmad, Gulzar
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31867 - 31891
  • [10] Label Driven Time-Frequency Masking for Robust Continuous Speech Recognition
    Soni, Meet
    Panda, Ashish
    [J]. INTERSPEECH 2019, 2019, : 426 - 430