DEEP BEAMFORMING NETWORKS FOR MULTI-CHANNEL SPEECH RECOGNITION

被引:0
|
作者
Xiao, Xiong [1 ]
Watanabe, Shinji [2 ]
Erdogan, Hakan [3 ]
Lu, Liang [4 ]
Hershey, John [2 ]
Seltzer, Michael L. [5 ]
Chen, Guoguo [6 ]
Zhang, Yu [7 ]
Mandel, Michael [8 ]
Yu, Dong [5 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] MERL, Cambridge, MA USA
[3] Sabanci Univ, Istanbul, Turkey
[4] Univ Edinburgh, Edinburgh EH8 9YL, Midlothian, Scotland
[5] Microsoft Res, Redmond, WA USA
[6] Johns Hopkins Univ, Baltimore, MD 21218 USA
[7] MIT, Cambridge, MA 02139 USA
[8] CUNY Brooklyn Coll, Brooklyn, NY 11210 USA
关键词
microphone arrays; direction of arrival; filter-and-sum beamforming; speech recognition; deep neural networks; NEURAL-NETWORKS; NOISY;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam-former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common cross-entropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.
引用
收藏
页码:5745 / 5749
页数:5
相关论文
共 50 条
  • [31] Deep convolutional and LSTM networks on multi-channel time series data for gait phase recognition
    Kreuzer, David
    Munz, Michael
    [J]. Sensors (Switzerland), 2021, 21 (03): : 1 - 15
  • [32] Deep Convolutional and LSTM Networks on Multi-Channel Time Series Data for Gait Phase Recognition
    Kreuzer, David
    Munz, Michael
    [J]. SENSORS, 2021, 21 (03) : 1 - 15
  • [33] UNet plus plus -Based Multi-Channel Speech Dereverberation and Distant Speech Recognition
    Zhao, Tuo
    Zhao, Yunxin
    Wang, Shaojun
    Han, Mei
    [J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [34] A Complex Neural Network Adaptive Beamforming for Multi-channel Speech Enhancement in Time Domain
    Jiang, Tao
    Liu, Hongqing
    Zhou, Yi
    Gan, Lu
    [J]. COMMUNICATIONS AND NETWORKING (CHINACOM 2021), 2022, : 129 - 139
  • [35] Beamforming and lightweight GRU neural network combination model for multi-channel speech enhancement
    Cao, Zhengdong
    Li, Dongmei
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (8-9) : 5677 - 5683
  • [36] Speech distortion weighted multi-channel Wiener filter and its application to speech recognition
    Kim, Gibak
    [J]. IEICE ELECTRONICS EXPRESS, 2015, 12 (06): : 1 - 7
  • [37] FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition
    Parcollet, Titouan
    Qiu, Xinchi
    Lane, Nicholas D.
    [J]. INTERSPEECH 2020, 2020, : 1678 - 1682
  • [38] Multi-channel underwater target recognition using deep learning
    Li, Chen
    Huang, Zhaoqiong
    Xu, Ji
    Guo, Xinyi
    Gong, Zaixiao
    Yan, Yonghong
    [J]. Yan, Yonghong (yanyonghong@hccl.ioa.ac.cn), 1600, Science Press (45): : 506 - 514
  • [39] FREQUENCY DOMAIN MULTI-CHANNEL ACOUSTIC MODELING FOR DISTANT SPEECH RECOGNITION
    Wu Minhua
    Kumatani, Kenichi
    Sundaram, Shiva
    Strom, Nikko
    Hoffmeister, Bjorn
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6640 - 6644
  • [40] PERFORMANCE MONITORING FOR AUTOMATIC SPEECH RECOGNITION IN NOISY MULTI-CHANNEL ENVIRONMENTS
    Meyerl, Bernd T.
    Mallidi, Sri Harish
    Martinez, Angel Mario Castro
    Paya-Vaya, Guillermo
    Kayser, Hendrik
    Hermansky, Hynek
    [J]. 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 50 - 56