Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning

被引:0
|
作者
Chen, Chen [1 ]
Hu, Yuchen [1 ]
Zhang, Qiang [2 ,3 ]
Zou, Heqing [1 ]
Zhu, Beier [1 ]
Chng, Eng Siong [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[2] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou, Peoples R China
[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China
来源
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11 | 2023年
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.
引用
收藏
页码:12607 / +
页数:10
相关论文
共 50 条
  • [31] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [32] An asynchronous DBN for audio-visual speech recognition
    Saenko, Kate
    Livescu, Karen
    2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
  • [33] Audio-visual speech recognition, one pass learning with spiking neurons
    Séguier, R
    Mercier, D
    ARTIFICIAL NEURAL NETWORKS - ICANN 2002, 2002, 2415 : 1207 - 1212
  • [34] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
  • [35] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [36] Audio-Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model
    Lee, Yong-Hyeok
    Jang, Dong-Won
    Kim, Jae-Bin
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 18
  • [37] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
  • [38] Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
    Fazlic, Husein
    Abd Almisre, Ali
    Tahir, Nooritawati Md
    JURNAL KEJURUTERAAN, 2024, 36 (01): : 147 - 154
  • [39] Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach
    Miao, Yajie
    Metze, Florian
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3414 - 3418
  • [40] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
    Yang, Chih-Chun
    Fan, Wan-Cyuan
    Yang, Cheng-Fu
    Wang, Yu-Chiang Frank
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044