Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning

被引：0

作者：

Chen, Chen ^{[1
]}

Hu, Yuchen ^{[1
]}

Zhang, Qiang ^{[2
,3
]}

Zou, Heqing ^{[1
]}

Zhu, Beier ^{[1
]}

Chng, Eng Siong ^{[1
]}

机构：

[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[2] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou, Peoples R China

[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11 | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

引用

页码：12607 / +

页数：10

共 50 条

[41] A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition
Ivanko, Denis
Ryumin, Dmitry
Karpov, Alexey
MATHEMATICS, 2023, 11 (12)
[42] MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Hu, Yuchen
Chen, Chen
Li, Ruizhe
Zou, Heqing
Chng, Eng Siong
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11610 - 11625
[43] Continuous Phoneme Recognition based on Audio-Visual Modality Fusion
Richter, Julius
Liebold, Jeanine
Gerkamnn, Timo
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[44] Speaker independent audio-visual continuous speech recognition
Liang, LH
Liu, XX
Zhao, YB
Pi, XB
Nefian, AV
IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
[45] Audio-visual speech recognition using lstm and cnn
El Maghraby E.E.
Gody A.M.
Farouk M.H.
Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039
[46] Audio-visual fuzzy fusion for robust speech recognition
Malcangi, M.
Ouazzane, K.
Patel, P.
2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
[47] Building a data corpus for audio-visual speech recognition
Chitu, Alin G.
Rothkrantz, Leon J. M.
EUROMEDIA '2007, 2007, : 88 - 92
[48] Audio-Visual Automatic Speech Recognition for Connected Digits
Wang, Xiaoping
Hao, Yufeng
Fu, Degang
Yuan, Chunwei
2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL III, PROCEEDINGS, 2008, : 328 - +
[49] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
Shao, Xu
Barker, Jon
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
[50] DARE: Deceiving Audio-Visual speech Recognition model
Mishra, Saumya
Gupta, Anup Kumar
Gupta, Puneet
KNOWLEDGE-BASED SYSTEMS, 2021, 232

← 1 2 3 4 5 →