Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition

被引:2
|
作者
Assefa, Maregu [1 ]
Jiang, Wei [1 ]
Zhan, Jinyu [1 ]
Gedamu, Kumie [2 ,3 ]
Yilma, Getinet [4 ]
Ayalew, Melese [1 ]
Adhikari, Deepak [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Informat & Software Engn, Chengdu 610054, Peoples R China
[2] Sichuan Artificial Intelligence Res Inst, Chengdu, Peoples R China
[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610056, Peoples R China
[4] Adama Sci & Technol Univ, Dept Comp Sci & Engn, Adama 1888, Ethiopia
基金
中国国家自然科学基金;
关键词
Action recognition; audio-visual learning; contrastive learning; semi-supervised learning;
D O I
10.1109/TMM.2023.3312856
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Semi-supervised video learning is an increasingly popular approach for improving video understanding tasks by utilizing large-scale unlabeled videos along with a few labels. Recent studies have shown that multimodal contrastive learning and consistency regularization are effective techniques for generating high-quality pseudo-labels for semi-supervised action recognition. However, existing pseudo-labeling approaches are solely based on the model's class predictions and can suffer from confirmation biases due to the accumulation of false predictions. To address this issue, we propose exploiting audio-visual feature correlations to achieve high-quality pseudo-labels instead of relying on model confidence. To achieve this goal, we introduce Audio-visual Contrastive and Consistency Learning (AvCLR) for semi-supervised action recognition. AvCLR generates reliable pseudo-labels from audio-visual feature correlations using deep embedded clustering to mitigate confirmation biases. Additionally, AvCLR introduces two contrastive modules: intra-modal contrastive learning (ImCL) and cross-modal contrastive learning (XmCL) to discover complementary information from audio-visual alignments. The ImCL module learns informative representations within audio and video independently, while the XmCL module aims to leverage global high-level features of audio-visual information. Furthermore, the XmCL is constrained by introducing intra-instance negatives from one modality to the other. We jointly optimize the model with ImCL, XmCL, and consistency regularization in an end-to-end semi-supervised manner. Experimental results have demonstrated that the proposed AvCLR framework is effective in reducing confirmation biases and outperforms existing confidence-based semi-supervised action recognition methods.
引用
收藏
页码:3491 / 3504
页数:14
相关论文
共 50 条
  • [1] SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION
    Liu, Yang
    Tan, Ying
    Lan, Haoyuan
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1000 - 1004
  • [2] Semi-Supervised Action Recognition with Temporal Contrastive Learning
    Singh, Ankit
    Chakraborty, Omprakash
    Varshney, Ashutosh
    Panda, Rameswar
    Feris, Rogerio
    Saenko, Kate
    Das, Abir
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10384 - 10394
  • [3] Audio-visual human recognition using semi-supervised spectral learning and hidden Markov models
    Feng, Wei
    Xie, Lei
    Zeng, Jia
    Liu, Zhi-Qiang
    [J]. JOURNAL OF VISUAL LANGUAGES AND COMPUTING, 2009, 20 (03): : 188 - 195
  • [4] Actor-Aware Contrastive Learning for Semi-Supervised Action Recognition
    Assefa, Maregu
    Jiang, Wei
    Gedamu, Kumie
    Yilma, Getinet
    Ayalew, Melese
    Seid, Mohammed
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 660 - 665
  • [5] Neighbor-Guided Consistent and Contrastive Learning for Semi-Supervised Action Recognition
    Wu, Jianlong
    Sun, Wei
    Gan, Tian
    Ding, Ning
    Jiang, Feijun
    Shen, Jialie
    Nie, Liqiang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 2215 - 2227
  • [6] Ego-Vehicle Action Recognition based on Semi-Supervised Contrastive Learning
    Noguchi, Chihiro
    Tanizawa, Toshihiro
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 5977 - 5987
  • [7] Continual semi-supervised learning through contrastive interpolation consistency
    Boschini, Matteo
    Buzzega, Pietro
    Bonicelli, Lorenzo
    Porrello, Angelo
    Calderara, Simone
    [J]. PATTERN RECOGNITION LETTERS, 2022, 162 : 9 - 14
  • [8] Semi-Supervised Contrastive Learning for Human Activity Recognition
    Liu, Dongxin
    Abdelzaher, Tarek
    [J]. 17TH ANNUAL INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SENSOR SYSTEMS (DCOSS 2021), 2021, : 45 - 53
  • [9] Semi-supervised Cross-domain Visual Feature Learning for Audio-Visual Broadcast Speech Transcription
    Su, Rongfeng
    Liu, Xunying
    Wang, Lan
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3509 - 3513
  • [10] Semi-Supervised Group Emotion Recognition Based on Contrastive Learning
    Zhang, Jiayi
    Wang, Xingzhi
    Zhang, Dong
    Lee, Dah-Jye
    [J]. ELECTRONICS, 2022, 11 (23)