Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition

被引：2

作者：

Assefa, Maregu ^{[1
]}

Jiang, Wei ^{[1
]}

Zhan, Jinyu ^{[1
]}

Gedamu, Kumie ^{[2
,3
]}

Yilma, Getinet ^{[4
]}

Ayalew, Melese ^{[1
]}

Adhikari, Deepak ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Informat & Software Engn, Chengdu 610054, Peoples R China

[2] Sichuan Artificial Intelligence Res Inst, Chengdu, Peoples R China

[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610056, Peoples R China

[4] Adama Sci & Technol Univ, Dept Comp Sci & Engn, Adama 1888, Ethiopia

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Action recognition; audio-visual learning; contrastive learning; semi-supervised learning;

D O I：

10.1109/TMM.2023.3312856

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Semi-supervised video learning is an increasingly popular approach for improving video understanding tasks by utilizing large-scale unlabeled videos along with a few labels. Recent studies have shown that multimodal contrastive learning and consistency regularization are effective techniques for generating high-quality pseudo-labels for semi-supervised action recognition. However, existing pseudo-labeling approaches are solely based on the model's class predictions and can suffer from confirmation biases due to the accumulation of false predictions. To address this issue, we propose exploiting audio-visual feature correlations to achieve high-quality pseudo-labels instead of relying on model confidence. To achieve this goal, we introduce Audio-visual Contrastive and Consistency Learning (AvCLR) for semi-supervised action recognition. AvCLR generates reliable pseudo-labels from audio-visual feature correlations using deep embedded clustering to mitigate confirmation biases. Additionally, AvCLR introduces two contrastive modules: intra-modal contrastive learning (ImCL) and cross-modal contrastive learning (XmCL) to discover complementary information from audio-visual alignments. The ImCL module learns informative representations within audio and video independently, while the XmCL module aims to leverage global high-level features of audio-visual information. Furthermore, the XmCL is constrained by introducing intra-instance negatives from one modality to the other. We jointly optimize the model with ImCL, XmCL, and consistency regularization in an end-to-end semi-supervised manner. Experimental results have demonstrated that the proposed AvCLR framework is effective in reducing confirmation biases and outperforms existing confidence-based semi-supervised action recognition methods.

引用

页码：3491 / 3504

页数：14

共 50 条

[31] Learning from Temporal Gradient for Semi-supervised Action Recognition
Xiao, Junfei
Jing, Longlong
Zhang, Lin
He, Ju
She, Qi
Zhou, Zongwei
Yuille, Alan
Li, Yingwei
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3242 - 3252
[32] CONTRASTIVE SIAMESE NETWORK FOR SEMI-SUPERVISED SPEECH RECOGNITION
Khorram, Soheil
Kim, Jaeyoung
Tripathi, Anshuman
Lu, Han
Zhang, Qian
Sak, Hasim
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7207 - 7211
[33] Dual Semi-Supervised Learning for Facial Action Unit Recognition
Peng, Guozhu
Wang, Shangfei
[J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8827 - 8834
[34] Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
Chen, Yanbei
Xian, Yongqin
Koepke, A. Sophia
Shan, Ying
Akata, Zeynep
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7012 - 7021
[35] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
Zhang, Zi-Qiang
Zhang, Jie
Zhang, Jian-Shu
Wu, Ming-Hui
Fang, Xin
Dai, Li-Rong
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
[36] X-Invariant Contrastive Augmentation and Representation Learning for Semi-Supervised Skeleton-Based Action Recognition
Xu, Binqian
Shu, Xiangbo
Song, Yan
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3852 - 3867
[37] Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Guo, Yuxin
Ma, Shijie
Su, Hu
Wang, Zhiqing
Zhao, Yuhao
Zou, Wei
Sun, Siyang
Zheng, Yun
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[38] Semi-supervised Audio Classification with Consistency-Based Regularization
Lu, Kangkang
Foo, Chuan-Sheng
Teh, Kah Kuan
Huy Dat Tran
Chandrasekhar, Vijay Ramaseshan
[J]. INTERSPEECH 2019, 2019, : 3654 - 3658
[39] FMixAugment for Semi-supervised Learning with Consistency Regularization
Lin, Huibin
Wang, Shiping
Liu, Zhanghui
Xiao, Shunxin
Du, Shide
Guo, Wenzhong
[J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2021, PT II, 2021, 13020 : 127 - 139
[40] Revisiting Consistency Regularization for Semi-Supervised Learning
Fan, Yue
Kukleva, Anna
Dai, Dengxin
Schiele, Bernt
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (03) : 626 - 643

← 1 2 3 4 5 →