MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET

被引:8
|
作者
Kong, Yuxiang [1 ,2 ]
Wu, Jian [1 ]
Wang, Quandong [2 ]
Gao, Peng [2 ]
Zhuang, Weiji [2 ]
Wang, Yujun [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Xiaomi Inc, Beijing, Peoples R China
关键词
Multi-channel speech recognition; robust speech recognition; deep learning; deep complex unet;
D O I
10.1109/SLT48900.2021.9383492
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.
引用
收藏
页码:104 / 110
页数:7
相关论文
共 50 条
  • [21] Method for adaptive on-line data fusion in Multi-Channel automatic speech recognition systems
    Ivanov, R
    2002 FIRST INTERNATIONAL IEEE SYMPOSIUM INTELLIGENT SYSTEMS, VOL 1, PROCEEDINGS, 2002, : 350 - 353
  • [22] AUTOMATIC CHANNEL SELECTION AND SPATIAL FEATURE INTEGRATION FOR MULTI-CHANNEL SPEECH RECOGNITION ACROSS VARIOUS ARRAY TOPOLOGIES
    Mu, Bingshen
    Guo, Pengcheng
    Guo, Dake
    Zhou, Pan
    Chen, Wei
    Xie, Lei
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 11396 - 11400
  • [23] Cognitive Load Recognition Using Multi-channel Complex Network Method
    Shang, Jian
    Zhang, Wei
    Xiong, Jiang
    Liu, Qingshan
    ADVANCES IN NEURAL NETWORKS, PT I, 2017, 10261 : 466 - 474
  • [24] Multi-channel spectrograms for speech processing applications using deep learning methods
    T. Arias-Vergara
    P. Klumpp
    J. C. Vasquez-Correa
    E. Nöth
    J. R. Orozco-Arroyave
    M. Schuster
    Pattern Analysis and Applications, 2021, 24 : 423 - 431
  • [25] Multi-channel Speech Separation Using Deep Embedding With Multilayer Bootstrap Networks
    Yang, Ziye
    Zhang, Xiao-Lei
    Fu, Zhonghua
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 716 - 719
  • [26] Multi-channel spectrograms for speech processing applications using deep learning methods
    Arias-Vergara, T.
    Klumpp, P.
    Vasquez-Correa, J. C.
    Noeth, E.
    Orozco-Arroyave, J. R.
    Schuster, M.
    PATTERN ANALYSIS AND APPLICATIONS, 2021, 24 (02) : 423 - 431
  • [27] Two-stage UNet with channel and temporal-frequency attention for multi-channel speech enhancement
    Xu, Shiyun
    Cao, Yinghan
    Zhang, Zehua
    Wang, Mingjiang
    SPEECH COMMUNICATION, 2025, 166
  • [28] END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION
    Chang, Feng-Ju
    Radfar, Martin
    Mouchtaris, Athanasios
    King, Brian
    Kunzmann, Siegfried
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5884 - 5888
  • [29] Multi-channel Attention for End-to-End Speech Recognition
    Braun, Stefan
    Neil, Daniel
    Anumula, Jithendar
    Ceolini, Enea
    Liu, Shih-Chii
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 17 - 21
  • [30] Audio-visual Multi-channel Recognition of Overlapped Speech
    Yu, Jianwei
    Wu, Bo
    Gu, Rongzhi
    Zhang, Shi-Xiong
    Chen, Lianwu
    Xu, Yong
    Yu, Meng
    Su, Dan
    Yu, Dong
    Liu, Xunying
    Meng, Helen
    INTERSPEECH 2020, 2020, : 3496 - 3500