Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

被引:5
|
作者
Liu, Hong [1 ]
Li, Wenhao [1 ]
Yang, Bing [1 ]
机构
[1] Peking Univ, Key Lab Machine Percept, Shenzhen Grad Sch, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Audio-Visual Fusion; Robust Speech Recognition; Multi-modality; Hybrid Fusion;
D O I
10.1109/ICPR48806.2021.9412817
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The fusion of audio and visual modalities is an important stage of audio-visual speech recognition (AVSR), which is generally approached through feature fusion or decision fusion. Feature fusion can exploit the covariations between features from different modalities effectively, whereas decision fusion shows the robustness of capturing an optimal combination of multi-modality. In this work, to take full advantage of the complementarity of the two fusion strategies and address the challenge of inherent ambiguity in noisy environments, we propose a novel hybrid fusion based AVSR method with residual networks and Bidirectional Gated Recurrent Unit (BGRU), which is able to distinguish homophones in both clean and noisy conditions. Specifically, a simple yet effective audio-visual encoder is used to map audio and visual features into a shared latent space to capture more discriminative multi-modal feature and find the internal correlation between spatial-temporal information for different modalities. Furthermore, a decision fusion module is designed to get final predictions in order to robustly utilize the reliability measures of audio-visual information. Finally, we introduce a combined loss, which shows its noise-robustness in learning the joint representation across various modalities. Experimental results on the largest publicly available dataset (LRW) demonstrate the robustness of the proposed method under various noisy conditions.
引用
收藏
页码:7580 / 7586
页数:7
相关论文
共 50 条
  • [1] Audio-visual fuzzy fusion for robust speech recognition
    Malcangi, M.
    Ouazzane, K.
    Patel, P.
    [J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
  • [2] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    [J]. APPLIED ACOUSTICS, 2023, 211
  • [3] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [4] Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition
    Wei, Liangfa
    Zhang, Jie
    Hou, Junfeng
    Dai, Lirong
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 638 - 643
  • [5] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [6] Robust audio-visual speech recognition based on late integration
    Lee, Jong-Seok
    Park, Cheol Hoon
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (05) : 767 - 779
  • [7] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [8] ROBUST AUDIO-VISUAL MANDARIN SPEECH RECOGNITION BASED ON ADAPTIVE DECISION FUSION AND TONE FEATURES
    Liu, Hong
    Chen, Zhengyan
    Shi, Wei
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1381 - 1385
  • [9] Audio-Visual Efficient Conformer for Robust Speech Recognition
    Burchi, Maxime
    Timofte, Radu
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2257 - 2266
  • [10] Research on Robust Audio-Visual Speech Recognition Algorithms
    Yang, Wenfeng
    Li, Pengyi
    Yang, Wei
    Liu, Yuxing
    He, Yulong
    Petrosian, Ovanes
    Davydenko, Aleksandr
    [J]. MATHEMATICS, 2023, 11 (07)