DEEP BEAMFORMING NETWORKS FOR MULTI-CHANNEL SPEECH RECOGNITION

被引:0
|
作者
Xiao, Xiong [1 ]
Watanabe, Shinji [2 ]
Erdogan, Hakan [3 ]
Lu, Liang [4 ]
Hershey, John [2 ]
Seltzer, Michael L. [5 ]
Chen, Guoguo [6 ]
Zhang, Yu [7 ]
Mandel, Michael [8 ]
Yu, Dong [5 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] MERL, Cambridge, MA USA
[3] Sabanci Univ, Istanbul, Turkey
[4] Univ Edinburgh, Edinburgh EH8 9YL, Midlothian, Scotland
[5] Microsoft Res, Redmond, WA USA
[6] Johns Hopkins Univ, Baltimore, MD 21218 USA
[7] MIT, Cambridge, MA 02139 USA
[8] CUNY Brooklyn Coll, Brooklyn, NY 11210 USA
关键词
microphone arrays; direction of arrival; filter-and-sum beamforming; speech recognition; deep neural networks; NEURAL-NETWORKS; NOISY;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam-former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common cross-entropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.
引用
收藏
页码:5745 / 5749
页数:5
相关论文
共 50 条
  • [1] SPEAKER ADAPTED BEAMFORMING FOR MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION
    Menne, Tobias
    Schlueter, Ralf
    Ney, Hermann
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 535 - 541
  • [2] Factorized MVDR Deep Beamforming for Multi-Channel Speech Enhancement
    Kim, Hansol
    Kang, Kyeongmuk
    Shin, Jong Won
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1898 - 1902
  • [3] Quaternion Neural Networks for Multi-channel Distant Speech Recognition
    Qiu, Xinchi
    Parcollet, Titouan
    Ravanelli, Mirco
    Lane, Nicholas D.
    Morchid, Mohamed
    [J]. INTERSPEECH 2020, 2020, : 329 - 333
  • [4] MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET
    Kong, Yuxiang
    Wu, Jian
    Wang, Quandong
    Gao, Peng
    Zhuang, Weiji
    Wang, Yujun
    Xie, Lei
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 104 - 110
  • [5] Combined Multi-channel NMF-based Robust Beamforming for Noisy Speech Recognition
    Mimura, Masato
    Bando, Yoshiaki
    Shimada, Kazuki
    Sakai, Shinsuke
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2451 - 2455
  • [6] Multi-Channel Transformer Transducer for Speech Recognition
    Chang, Feng-Ju
    Radfar, Martin
    Mouchtaris, Athanasios
    Omologo, Maurizio
    [J]. INTERSPEECH 2021, 2021, : 296 - 300
  • [7] Multi-channel Speech Separation Using Deep Embedding With Multilayer Bootstrap Networks
    Yang, Ziye
    Zhang, Xiao-Lei
    Fu, Zhonghua
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 716 - 719
  • [8] An iterative mask estimation approach to deep learning based multi-channel speech recognition
    Tu, Yan-Hui
    Du, Jun
    Sun, Lei
    Ma, Feng
    Wang, Hai-Kun
    Chen, Jing-Dong
    Lee, Chin-Hui
    [J]. SPEECH COMMUNICATION, 2019, 106 : 31 - 43
  • [9] Multi-channel sub-band speech recognition
    McCowan I.A.
    Sridharan S.
    [J]. EURASIP Journal on Advances in Signal Processing, 2001 (1) : 45 - 52
  • [10] Multi-Channel Feature Adaptation for Robust Speech Recognition
    Zhang, Zhaofeng
    Xiao, Xiong
    Wang, Longbiao
    Dang, Jianwu
    Iwahashi, Masahiro
    Chng, Eng Siong
    Li, Haizhou
    [J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,