Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition

被引:2
|
作者
Assefa, Maregu [1 ]
Jiang, Wei [1 ]
Zhan, Jinyu [1 ]
Gedamu, Kumie [2 ,3 ]
Yilma, Getinet [4 ]
Ayalew, Melese [1 ]
Adhikari, Deepak [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Informat & Software Engn, Chengdu 610054, Peoples R China
[2] Sichuan Artificial Intelligence Res Inst, Chengdu, Peoples R China
[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610056, Peoples R China
[4] Adama Sci & Technol Univ, Dept Comp Sci & Engn, Adama 1888, Ethiopia
基金
中国国家自然科学基金;
关键词
Action recognition; audio-visual learning; contrastive learning; semi-supervised learning;
D O I
10.1109/TMM.2023.3312856
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Semi-supervised video learning is an increasingly popular approach for improving video understanding tasks by utilizing large-scale unlabeled videos along with a few labels. Recent studies have shown that multimodal contrastive learning and consistency regularization are effective techniques for generating high-quality pseudo-labels for semi-supervised action recognition. However, existing pseudo-labeling approaches are solely based on the model's class predictions and can suffer from confirmation biases due to the accumulation of false predictions. To address this issue, we propose exploiting audio-visual feature correlations to achieve high-quality pseudo-labels instead of relying on model confidence. To achieve this goal, we introduce Audio-visual Contrastive and Consistency Learning (AvCLR) for semi-supervised action recognition. AvCLR generates reliable pseudo-labels from audio-visual feature correlations using deep embedded clustering to mitigate confirmation biases. Additionally, AvCLR introduces two contrastive modules: intra-modal contrastive learning (ImCL) and cross-modal contrastive learning (XmCL) to discover complementary information from audio-visual alignments. The ImCL module learns informative representations within audio and video independently, while the XmCL module aims to leverage global high-level features of audio-visual information. Furthermore, the XmCL is constrained by introducing intra-instance negatives from one modality to the other. We jointly optimize the model with ImCL, XmCL, and consistency regularization in an end-to-end semi-supervised manner. Experimental results have demonstrated that the proposed AvCLR framework is effective in reducing confirmation biases and outperforms existing confidence-based semi-supervised action recognition methods.
引用
收藏
页码:3491 / 3504
页数:14
相关论文
共 50 条
  • [31] Learning from Temporal Gradient for Semi-supervised Action Recognition
    Xiao, Junfei
    Jing, Longlong
    Zhang, Lin
    He, Ju
    She, Qi
    Zhou, Zongwei
    Yuille, Alan
    Li, Yingwei
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3242 - 3252
  • [32] CONTRASTIVE SIAMESE NETWORK FOR SEMI-SUPERVISED SPEECH RECOGNITION
    Khorram, Soheil
    Kim, Jaeyoung
    Tripathi, Anshuman
    Lu, Han
    Zhang, Qian
    Sak, Hasim
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7207 - 7211
  • [33] Dual Semi-Supervised Learning for Facial Action Unit Recognition
    Peng, Guozhu
    Wang, Shangfei
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8827 - 8834
  • [34] Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
    Chen, Yanbei
    Xian, Yongqin
    Koepke, A. Sophia
    Shan, Ying
    Akata, Zeynep
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7012 - 7021
  • [35] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [36] X-Invariant Contrastive Augmentation and Representation Learning for Semi-Supervised Skeleton-Based Action Recognition
    Xu, Binqian
    Shu, Xiangbo
    Song, Yan
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3852 - 3867
  • [37] Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
    Guo, Yuxin
    Ma, Shijie
    Su, Hu
    Wang, Zhiqing
    Zhao, Yuhao
    Zou, Wei
    Sun, Siyang
    Zheng, Yun
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [38] Semi-supervised Audio Classification with Consistency-Based Regularization
    Lu, Kangkang
    Foo, Chuan-Sheng
    Teh, Kah Kuan
    Huy Dat Tran
    Chandrasekhar, Vijay Ramaseshan
    [J]. INTERSPEECH 2019, 2019, : 3654 - 3658
  • [39] FMixAugment for Semi-supervised Learning with Consistency Regularization
    Lin, Huibin
    Wang, Shiping
    Liu, Zhanghui
    Xiao, Shunxin
    Du, Shide
    Guo, Wenzhong
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2021, PT II, 2021, 13020 : 127 - 139
  • [40] Revisiting Consistency Regularization for Semi-Supervised Learning
    Fan, Yue
    Kukleva, Anna
    Dai, Dengxin
    Schiele, Bernt
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (03) : 626 - 643