Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

被引：5

作者：

Liu, Hong ^{[1
]}

Li, Wenhao ^{[1
]}

Yang, Bing ^{[1
]}

机构：

[1] Peking Univ, Key Lab Machine Percept, Shenzhen Grad Sch, Shenzhen, Peoples R China

来源：

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2021年

基金：

中国国家自然科学基金;

关键词：

Audio-Visual Fusion; Robust Speech Recognition; Multi-modality; Hybrid Fusion;

D O I：

10.1109/ICPR48806.2021.9412817

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The fusion of audio and visual modalities is an important stage of audio-visual speech recognition (AVSR), which is generally approached through feature fusion or decision fusion. Feature fusion can exploit the covariations between features from different modalities effectively, whereas decision fusion shows the robustness of capturing an optimal combination of multi-modality. In this work, to take full advantage of the complementarity of the two fusion strategies and address the challenge of inherent ambiguity in noisy environments, we propose a novel hybrid fusion based AVSR method with residual networks and Bidirectional Gated Recurrent Unit (BGRU), which is able to distinguish homophones in both clean and noisy conditions. Specifically, a simple yet effective audio-visual encoder is used to map audio and visual features into a shared latent space to capture more discriminative multi-modal feature and find the internal correlation between spatial-temporal information for different modalities. Furthermore, a decision fusion module is designed to get final predictions in order to robustly utilize the reliability measures of audio-visual information. Finally, we introduce a combined loss, which shows its noise-robustness in learning the joint representation across various modalities. Experimental results on the largest publicly available dataset (LRW) demonstrate the robustness of the proposed method under various noisy conditions.

引用

页码：7580 / 7586

页数：7

共 50 条

[1] Audio-visual fuzzy fusion for robust speech recognition
Malcangi, M.
Ouazzane, K.
Patel, P.
[J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
[2] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
[J]. APPLIED ACOUSTICS, 2023, 211
[3] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
Sterpu, George
Saam, Christian
Harte, Naomi
[J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
[4] Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition
Wei, Liangfa
Zhang, Jie
Hou, Junfeng
Dai, Lirong
[J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 638 - 643
[5] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
[J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[6] Robust audio-visual speech recognition based on late integration
Lee, Jong-Seok
Park, Cheol Hoon
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (05) : 767 - 779
[7] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[8] ROBUST AUDIO-VISUAL MANDARIN SPEECH RECOGNITION BASED ON ADAPTIVE DECISION FUSION AND TONE FEATURES
Liu, Hong
Chen, Zhengyan
Shi, Wei
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1381 - 1385
[9] Research on Robust Audio-Visual Speech Recognition Algorithms
Yang, Wenfeng
Li, Pengyi
Yang, Wei
Liu, Yuxing
He, Yulong
Petrosian, Ovanes
Davydenko, Aleksandr
[J]. MATHEMATICS, 2023, 11 (07)
[10] Audio-Visual Efficient Conformer for Robust Speech Recognition
Burchi, Maxime
Timofte, Radu
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2257 - 2266

← 1 2 3 4 5 →