Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition

被引:4
|
作者
Hwang, Jung-Wook [1 ]
Park, Jeongkyun [2 ]
Park, Rae-Hong [1 ,3 ]
Park, Hyung-Min [1 ]
机构
[1] Sogang Univ, Dept Elect Engn, Seoul 04107, South Korea
[2] Sogang Univ, Dept Artificial Intelligence, Seoul 04107, South Korea
[3] Sogang Univ, ICT Convergence Disaster Safety Res Inst, Seoul 04107, South Korea
基金
新加坡国家研究基金会;
关键词
Audio-visual speech recognition; Audio-visual speech enhancement; Deep learning; Joint training; Conformer; Robust speech recognition; DEREVERBERATION; NOISE;
D O I
10.1016/j.apacoust.2023.109478
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Visual features are attractive cues that can be used for robust automatic speech recognition (ASR). In par-ticular, speech recognition performance can be improved by combining audio with visual information obtained from the speaker's face rather than using only audio in acoustically unfavorable environments. For this reason, recently, studies on various audio-visual speech recognition (AVSR) models have been actively conducted. However, from the experimental results of the AVSR models, important information for speech recognition is mainly concentrated on audio signals, and visual information plays a role in enhancing the robustness of recognition when the audio signal is corrupted in noisy environments. Therefore, there is a limit to the improvement of the recognition performance of conventional AVSR mod-els in noisy environments. Unlike the conventional AVSR models that directly use input audio-visual information as it is, in this paper, we propose an AVSR model that first performs AVSE to enhance target speech based on audio-visual information and then uses both audio information enhanced by the AVSE and visual information such as the speaker's lips or face. In particular, we propose a deep AVSR model that performs end-to-end training as one model by integrating an AVSR model based on the conformer with hybrid decoding and an AVSE model based on the U-net with recurrent neural network (RNN) atten-tion (RA). Experimental results on the LRS2-BBC and LRS3-TED datasets demonstrate that the AVSE model effectively suppresses corrupting noise and the AVSR model successfully achieves noise robustness. Especially, the proposed jointly trained model integrating the AVSE and AVSR stages into one model showed better recognition performance than the other compared methods.& COPY; 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] DARE: Deceiving Audio-Visual speech Recognition model
    Mishra, Saumya
    Gupta, Anup Kumar
    Gupta, Puneet
    KNOWLEDGE-BASED SYSTEMS, 2021, 232
  • [42] Investigation of DNN-Based Audio-Visual Speech Recognition
    Tamura, Satoshi
    Ninomiya, Hiroshi
    Kitaoka, Norihide
    Osuga, Shin
    Iribe, Yurie
    Takeda, Kazuya
    Hayamizu, Satoru
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (10): : 2444 - 2451
  • [43] Depth-based Features in Audio-Visual Speech Recognition
    Palecek, Karel
    Chaloupka, Josef
    2016 39TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2016, : 303 - 306
  • [44] Relevant feature selection for audio-visual speech recognition
    Drugman, Thomas
    Gurban, Mihai
    Thiran, Jean-Philippe
    2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2007, : 179 - +
  • [45] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [46] Weighting schemes for audio-visual fusion in speech recognition
    Glotin, H
    Vergyri, D
    Neti, C
    Potamianos, G
    Luettin, J
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 173 - 176
  • [47] Dynamic Bayesian Networks for Audio-Visual Speech Recognition
    Ara V. Nefian
    Luhong Liang
    Xiaobo Pi
    Xiaoxing Liu
    Kevin Murphy
    EURASIP Journal on Advances in Signal Processing, 2002
  • [48] Multistage information fusion for audio-visual speech recognition
    Chu, SM
    Libal, V
    Marcheret, E
    Neti, C
    Potamianos, G
    2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
  • [49] DBN based models for audio-visual speech analysis and recognition
    Ravyse, Ilse
    Jiang, Dongmei
    Jiang, Xiaoyue
    Lv, Guoyun
    Hou, Yunshu
    Sahli, Hichem
    Zhao, Rongchun
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2006, PROCEEDINGS, 2006, 4261 : 19 - 30
  • [50] On Dynamic Stream Weighting for Audio-Visual Speech Recognition
    Estellers, Virginia
    Gurban, Mihai
    Thiran, Jean-Philippe
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04): : 1145 - 1157