Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning

被引：0

作者：

Chen, Chen ^{[1
]}

Hu, Yuchen ^{[1
]}

Zhang, Qiang ^{[2
,3
]}

Zou, Heqing ^{[1
]}

Zhu, Beier ^{[1
]}

Chng, Eng Siong ^{[1
]}

机构：

[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[2] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou, Peoples R China

[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11 | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

引用

页码：12607 / +

页数：10

共 50 条

[31] Speaker independent audio-visual speech recognition
Zhang, Y
Levinson, S
Huang, T
2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
[32] An asynchronous DBN for audio-visual speech recognition
Saenko, Kate
Livescu, Karen
2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
[33] Audio-visual speech recognition, one pass learning with spiking neurons
Séguier, R
Mercier, D
ARTIFICIAL NEURAL NETWORKS - ICANN 2002, 2002, 2415 : 1207 - 1212
[34] Audio-visual modeling for bimodal speech recognition
Kaynak, MN
Zhi, Q
Cheok, AD
Sengupta, K
Chung, KC
2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
[35] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[36] Audio-Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model
Lee, Yong-Hyeok
Jang, Dong-Won
Kim, Jae-Bin
Park, Rae-Hong
Park, Hyung-Min
APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 18
[37] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
Morrone, Giovanni
Michelsanti, Daniel
Tan, Zheng-Hua
Jensen, Jesper
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
[38] Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
Fazlic, Husein
Abd Almisre, Ali
Tahir, Nooritawati Md
JURNAL KEJURUTERAAN, 2024, 36 (01): : 147 - 154
[39] Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach
Miao, Yajie
Metze, Florian
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3414 - 3418
[40] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
Yang, Chih-Chun
Fan, Wan-Cyuan
Yang, Cheng-Fu
Wang, Yu-Chiang Frank
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044

← 1 2 3 4 5 →