Noise-Tolerant Learning for Audio-Visual Action Recognition

被引：0

作者：

Han, Haochen ^{[1
,3
]}

Zheng, Qinghua ^{[1
,3
]}

Luo, Minnan ^{[1
,3
]}

Miao, Kaiyao ^{[2
,3
,4
]}

Tian, Feng ^{[1
,3
]}

Chen, Yan ^{[1
,3
]}

机构：

[1] Xi An Jiao Tong Univ, Natl Engn Lab Big Data Analyt, Xian 710049, Peoples R China

[2] Xi An Jiao Tong Univ, Key Lab Intelligent Networks & Network Secur, Minist Educ, Xian 710049, Peoples R China

[3] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Xian 710049, Peoples R China

[4] Xi An Jiao Tong Univ, Sch Cyber Sci & Engn, Xian 710049, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Action recognition; audio-visual learning; noisy labels; noisy correspondence; NETWORKS;

D O I：

10.1109/TMM.2024.3371220

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, video recognition is emerging with the help of multi-modal learning, which focuses on integrating distinct modalities to improve the performance or robustness of the model. Although various multi-modal learning methods have been proposed and offer remarkable recognition results, almost all of these methods rely on high-quality manual annotations and assume that modalities among multi-modal data provide semantically relevant information. Unfortunately, the widely used video datasets are usually coarse-annotated or collected from the Internet. Thus, it inevitably contains a portion of noisy labels and noisy correspondence. To address this challenge, we use the audio-visual action recognition task as a proxy and propose a noise-tolerant learning framework to find anti-interference model parameters against both noisy labels and noisy correspondence. Specifically, our method consists of two phases that aim to rectify noise by the inherent correlation between modalities. First, a noise-tolerant contrastive training phase is performed to make the model immune to the possible noisy-labeled data. Despite the benefits brought by contrastive training, it would overfit the noisy correspondence and thus provide false supervision. To alleviate the influence of noisy correspondence, we propose a cross-modal noise estimation component to adjust the consistency between different modalities. As the noisy correspondence existed at the instance level, we further propose a category-level contrastive loss to reduce its interference. Second, in the hybrid-supervised training phase, we calculate the distance metric among features to obtain corrected labels, which are used as complementary supervision to guide the training. Furthermore, due to the lack of suitable datasets, we establish a benchmark of real-world noisy correspondence in audio-visual data by relabeling the Kinetics dataset. Extensive experiments on a wide range of noisy levels demonstrate that our method significantly improves the robustness of the action recognition model and surpasses the baselines by a clear margin.

引用

页码：7761 / 7774

页数：14

共 50 条

[41] Multi-Corpus Learning for Audio-Visual Emotions and Sentiment Recognition
Ryumina, Elena
Markitantov, Maxim
Karpov, Alexey
[J]. MATHEMATICS, 2023, 11 (16)
[42] Noise-tolerant parallel learning of geometric concepts
Bshouty, NH
Goldman, SA
Mathias, HD
[J]. INFORMATION AND COMPUTATION, 1998, 147 (01) : 89 - 110
[43] Agreement or Disagreement in Noise-tolerant Mutual Learning?
Liu, Jiarun
Jiang, Daguang
Yang, Yukun
Li, Ruirui
[J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4801 - 4807
[44] Deep learning for noise-tolerant RDFS reasoning
Makni, Bassem
Hendler, James
[J]. SEMANTIC WEB, 2019, 10 (05) : 823 - 862
[45] Metric Learning-Based Multimodal Audio-Visual Emotion Recognition
Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
[J]. IEEE MULTIMEDIA, 2020, 27 (01) : 37 - 48
[46] Leveraging recent advances in deep learning for audio-Visual emotion recognition
Schoneveld, Liam
Othmani, Alice
Abdelkawy, Hazem
[J]. PATTERN RECOGNITION LETTERS, 2021, 146 : 1 - 7
[47] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
Ma, Fei
Zhang, Wei
Li, Yang
Huang, Shao-Lun
Zhang, Lin
[J]. APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23
[48] A Team of Continuous-Action Learning Automata for Noise-Tolerant Learning of Half-Spaces
Sastry, P. S.
Nagendra, G. D.
Manwani, Naresh
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2010, 40 (01): : 19 - 28
[49] EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Kazakos, Evangelos
Nagrani, Arsha
Zisserman, Andrew
Damen, Dima
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5491 - 5500
[50] Noise-Tolerant Paradigm for Training Face Recognition CNNs
Hu, Wei
Huang, Yangyu
Zhang, Fan
Li, Ruirui
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 11879 - 11888

← 1 2 3 4 5 →