Noise-Tolerant Learning for Audio-Visual Action Recognition

被引:0
|
作者
Han, Haochen [1 ,3 ]
Zheng, Qinghua [1 ,3 ]
Luo, Minnan [1 ,3 ]
Miao, Kaiyao [2 ,3 ,4 ]
Tian, Feng [1 ,3 ]
Chen, Yan [1 ,3 ]
机构
[1] Xi An Jiao Tong Univ, Natl Engn Lab Big Data Analyt, Xian 710049, Peoples R China
[2] Xi An Jiao Tong Univ, Key Lab Intelligent Networks & Network Secur, Minist Educ, Xian 710049, Peoples R China
[3] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Xian 710049, Peoples R China
[4] Xi An Jiao Tong Univ, Sch Cyber Sci & Engn, Xian 710049, Peoples R China
关键词
Action recognition; audio-visual learning; noisy labels; noisy correspondence; NETWORKS;
D O I
10.1109/TMM.2024.3371220
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, video recognition is emerging with the help of multi-modal learning, which focuses on integrating distinct modalities to improve the performance or robustness of the model. Although various multi-modal learning methods have been proposed and offer remarkable recognition results, almost all of these methods rely on high-quality manual annotations and assume that modalities among multi-modal data provide semantically relevant information. Unfortunately, the widely used video datasets are usually coarse-annotated or collected from the Internet. Thus, it inevitably contains a portion of noisy labels and noisy correspondence. To address this challenge, we use the audio-visual action recognition task as a proxy and propose a noise-tolerant learning framework to find anti-interference model parameters against both noisy labels and noisy correspondence. Specifically, our method consists of two phases that aim to rectify noise by the inherent correlation between modalities. First, a noise-tolerant contrastive training phase is performed to make the model immune to the possible noisy-labeled data. Despite the benefits brought by contrastive training, it would overfit the noisy correspondence and thus provide false supervision. To alleviate the influence of noisy correspondence, we propose a cross-modal noise estimation component to adjust the consistency between different modalities. As the noisy correspondence existed at the instance level, we further propose a category-level contrastive loss to reduce its interference. Second, in the hybrid-supervised training phase, we calculate the distance metric among features to obtain corrected labels, which are used as complementary supervision to guide the training. Furthermore, due to the lack of suitable datasets, we establish a benchmark of real-world noisy correspondence in audio-visual data by relabeling the Kinetics dataset. Extensive experiments on a wide range of noisy levels demonstrate that our method significantly improves the robustness of the action recognition model and surpasses the baselines by a clear margin.
引用
收藏
页码:7761 / 7774
页数:14
相关论文
共 50 条
  • [1] Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
    Kim, Ui-Hyun
    [J]. INTERSPEECH 2021, 2021, : 326 - 330
  • [2] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
    Huang, Jing
    Kingsbury, Brian
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
  • [3] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [4] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [5] NOISE-TOLERANT AUDIO-VISUAL ONLINE PERSON VERIFICATION USING AN ATTENTION-BASED NEURAL NETWORK FUSION
    Shon, Suwon
    Oh, Tae-Hyun
    Glass, James
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3995 - 3999
  • [6] Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition
    Assefa, Maregu
    Jiang, Wei
    Zhan, Jinyu
    Gedamu, Kumie
    Yilma, Getinet
    Ayalew, Melese
    Adhikari, Deepak
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3491 - 3504
  • [7] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [8] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    [J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [9] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    [J]. APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [10] Noise adaptive stream weighting in audio-visual speech recognition
    Heckmann, M
    Berthommier, F
    Kroschel, K
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1260 - 1273