MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET

被引:8
|
作者
Kong, Yuxiang [1 ,2 ]
Wu, Jian [1 ]
Wang, Quandong [2 ]
Gao, Peng [2 ]
Zhuang, Weiji [2 ]
Wang, Yujun [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Xiaomi Inc, Beijing, Peoples R China
关键词
Multi-channel speech recognition; robust speech recognition; deep learning; deep complex unet;
D O I
10.1109/SLT48900.2021.9383492
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.
引用
收藏
页码:104 / 110
页数:7
相关论文
共 50 条
  • [31] Quaternion Neural Networks for Multi-channel Distant Speech Recognition
    Qiu, Xinchi
    Parcollet, Titouan
    Ravanelli, Mirco
    Lane, Nicholas D.
    Morchid, Mohamed
    INTERSPEECH 2020, 2020, : 329 - 333
  • [32] Multi-channel Opus compression for far-field automatic speech recognition with a fixed bitrate budget
    Drude, Lukas
    Heymann, Jahn
    Schwarz, Andreas
    Valin, Jean-Marc
    INTERSPEECH 2021, 2021, : 1669 - 1673
  • [33] THE FOSAFER SYSTEM FOR THE ICASSP2024 IN-CAR MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION CHALLENGE
    Huang, Shangkun
    Du, Yuxuan
    Wang, Yankai
    Deng, Jing
    Zheng, Rong
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 5 - 6
  • [34] A GENERATIVE-DISCRIMINATIVE HYBRID APPROACH TO MULTI-CHANNEL NOISE REDUCTION FOR ROBUST AUTOMATIC SPEECH RECOGNITION
    Mentzner, Hendrik
    Araki, Shoko
    Fujimoto, Masakiyo
    Nakatani, Totohiro
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5740 - 5744
  • [35] MULTI-CHANNEL OVERLAPPED SPEECH RECOGNITION WITH LOCATION GUIDED SPEECH EXTRACTION NETWORK
    Chen, Zhuo
    Xiao, Xiong
    Yoshioka, Takuya
    Erdogan, Hakan
    Li, Jinyu
    Gong, Yifan
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 558 - 565
  • [36] A Spatiotemporal Multi-Channel Learning Framework for Automatic Modulation Recognition
    Xu, Jialang
    Luo, Chunbo
    Parr, Gerard
    Luo, Yang
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2020, 9 (10) : 1629 - 1632
  • [37] Robust Speech Recognition Using Feature-domain Multi-channel Bayesian Estimators
    Principi, Emanuele
    Rotili, Rudy
    Cifani, Simone
    Marinelli, Lorenzo
    Squartini, Stefano
    Piazza, Francesco
    2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, 2010, : 2670 - 2673
  • [38] Factorized MVDR Deep Beamforming for Multi-Channel Speech Enhancement
    Kim, Hansol
    Kang, Kyeongmuk
    Shin, Jong Won
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1898 - 1902
  • [39] CONSISTENCY-AWARE MULTI-CHANNEL SPEECH ENHANCEMENT USING DEEP NEURAL NETWORKS
    Masuyama, Yoshiki
    Togami, Masahito
    Komatsu, Tatsuya
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 821 - 825
  • [40] A unified network for multi-speaker speech recognition with multi-channel recordings
    Liu, Conggui
    Inoue, Nakamasa
    Shinoda, Koichi
    2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1304 - 1307