Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning

被引：0

作者：

Chen, Chen ^{[1
]}

Hu, Yuchen ^{[1
]}

Zhang, Qiang ^{[2
,3
]}

Zou, Heqing ^{[1
]}

Zhu, Beier ^{[1
]}

Chng, Eng Siong ^{[1
]}

机构：

[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[2] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou, Peoples R China

[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11 | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

引用

页码：12607 / +

页数：10

共 50 条

[1] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
Zhang, Zi-Qiang
Zhang, Jie
Zhang, Jian-Shu
Wu, Ming-Hui
Fang, Xin
Dai, Li-Rong
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
[2] INFANTS PERCEPTUAL DIFFERENTIATION OF AMODAL AND MODALITY-SPECIFIC AUDIO-VISUAL RELATIONS
BAHRICK, LE
JOURNAL OF EXPERIMENTAL CHILD PSYCHOLOGY, 1992, 53 (02) : 180 - 199
[3] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
Pan, Xichen
Chen, Peiyu
Gong, Yichen
Zhou, Helong
Wang, Xinbing
Lin, Zhouhan
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
[4] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
Mroueh, Youssef
Marcheret, Etienne
Goel, Vaibhava
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
[5] Audio-visual speech recognition using deep learning
Noda, Kuniaki
Yamaguchi, Yuki
Nakadai, Kazuhiro
Okuno, Hiroshi G.
Ogata, Tetsuya
APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
[6] Audio-visual speech recognition using deep learning
Kuniaki Noda
Yuki Yamaguchi
Kazuhiro Nakadai
Hiroshi G. Okuno
Tetsuya Ogata
Applied Intelligence, 2015, 42 : 722 - 737
[7] MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
Zhou, Pan
Yang, Wenwen
Chen, Wei
Wang, Yanfeng
Jia, Jia
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6565 - 6569
[8] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
APPLIED ACOUSTICS, 2023, 211
[9] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[10] Audio-Visual Biometric Recognition Via Joint Sparse Representations
Primorac, Rudi
Togneri, Roberto
Bennamoun, Mohammed
Sohel, Ferdous
2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3031 - 3035

← 1 2 3 4 5 →