Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning

被引:0
|
作者
Chen, Chen [1 ]
Hu, Yuchen [1 ]
Zhang, Qiang [2 ,3 ]
Zou, Heqing [1 ]
Zhu, Beier [1 ]
Chng, Eng Siong [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[2] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou, Peoples R China
[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.
引用
收藏
页码:12607 / +
页数:10
相关论文
共 50 条
  • [41] A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition
    Ivanko, Denis
    Ryumin, Dmitry
    Karpov, Alexey
    MATHEMATICS, 2023, 11 (12)
  • [42] MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
    Hu, Yuchen
    Chen, Chen
    Li, Ruizhe
    Zou, Heqing
    Chng, Eng Siong
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11610 - 11625
  • [43] Continuous Phoneme Recognition based on Audio-Visual Modality Fusion
    Richter, Julius
    Liebold, Jeanine
    Gerkamnn, Timo
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [44] Speaker independent audio-visual continuous speech recognition
    Liang, LH
    Liu, XX
    Zhao, YB
    Pi, XB
    Nefian, AV
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
  • [45] Audio-visual speech recognition using lstm and cnn
    El Maghraby E.E.
    Gody A.M.
    Farouk M.H.
    Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039
  • [46] Audio-visual fuzzy fusion for robust speech recognition
    Malcangi, M.
    Ouazzane, K.
    Patel, P.
    2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
  • [47] Building a data corpus for audio-visual speech recognition
    Chitu, Alin G.
    Rothkrantz, Leon J. M.
    EUROMEDIA '2007, 2007, : 88 - 92
  • [48] Audio-Visual Automatic Speech Recognition for Connected Digits
    Wang, Xiaoping
    Hao, Yufeng
    Fu, Degang
    Yuan, Chunwei
    2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL III, PROCEEDINGS, 2008, : 328 - +
  • [49] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
    Shao, Xu
    Barker, Jon
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
  • [50] DARE: Deceiving Audio-Visual speech Recognition model
    Mishra, Saumya
    Gupta, Anup Kumar
    Gupta, Puneet
    KNOWLEDGE-BASED SYSTEMS, 2021, 232