Attentive spatial-temporal contrastive learning for self-supervised video representation

被引:3
|
作者
Yang, Xingming [1 ,2 ]
Xiong, Sixuan [2 ]
Wu, Kewei [1 ,2 ]
Shan, Dongfeng [2 ]
Xie, Zhao [1 ,2 ]
机构
[1] Hefei Univ Technol, Key Lab Knowledge Engn Big Data, Minist Educ, Hefei 230601, Peoples R China
[2] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Peoples R China
基金
安徽省自然科学基金;
关键词
Self-supervised learning; Spatial-temporal feature; Contrastive learning; Spatial-temporal self-attention;
D O I
10.1016/j.imavis.2023.104765
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most existing self-supervised works learn video representation by using a single pretext task. A single pretext task, providing single supervision from unlabeled data, may neglect to describe the difference between spatial features and temporal features. The similar spatial features and temporal features may hinder distinguishing between two similar videos with different class labels. In this paper, we propose an attentive spatial-temporal contrastive learning network (ASTCNet), which learns self-attention spatial-temporal features by contrastive learning between multiple spatial and temporal pretext tasks. The spatial features are learned by multiple spatial pretext tasks, including spatial rotation, and spatial jigsaw. Each spatial feature is enhanced with spatial selfattention by learning the relation between patches. The temporal features are learned by multiple temporal pretext tasks, including temporal order, and temporal pace. Each temporal feature is enhanced with temporal self-attention by learning the relation between frames, and is enhanced by feeding the optical flow features into a motion module. To separate the spatial feature and the temporal feature learned in one video, we represent the video as different features for each pretext task, and design pretext task-based contrastive loss. The pretext taskbased contrastive loss encourages the different pretext tasks to learn dissimilar features, and encourages the same pretext task to learn similar features. The pretext task-based contrastive loss can learn the discriminative features for each pretext task in one video. The experiments show that our method achieves state-of-the-art performance for self-supervised action recognition on the UCF101 dataset and the HMDB51 dataset.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning
    Zhang, Zehua
    Crandall, David
    [J]. 2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 975 - 985
  • [2] TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning
    Liu, Yang
    Wang, Keze
    Liu, Lingbo
    Lan, Haoyuan
    Lin, Liang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1978 - 1993
  • [3] Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation
    Wang, Lulu
    Xu, Zengmin
    Zhang, Xuelian
    Meng, Ruxing
    Lu, Tao
    [J]. Computer Engineering and Applications, 60 (18): : 158 - 166
  • [4] Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation
    Zhang, Yujia
    Po, Lai-Man
    Xu, Xuyuan
    Liu, Mengyang
    Wang, Yexin
    Ou, Weifeng
    Zhao, Yuzhi
    Yu, Wing-Yin
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3380 - 3389
  • [5] Motion Sensitive Contrastive Learning for Self-supervised Video Representation
    Ni, Jingcheng
    Zhou, Nan
    Qin, Jie
    Wu, Qian
    Liu, Junqi
    Li, Boxun
    Huang, Di
    [J]. COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 457 - 474
  • [6] Self-Supervised Representation Learning With Spatial-Temporal Consistency for Sign Language Recognition
    Zhao, Weichao
    Zhou, Wengang
    Hu, Hezhen
    Wang, Min
    Li, Houqiang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4188 - 4201
  • [7] Self-Supervised Video Representation Learning with Meta-Contrastive Network
    Lin, Yuanze
    Guo, Xun
    Lu, Yan
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8219 - 8229
  • [8] Spatial-Temporal Hypergraph Self-Supervised Learning for Crime Prediction
    Li, Zhonghang
    Huang, Chao
    Xia, Lianghao
    Xu, Yong
    Pei, Jian
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 2984 - 2996
  • [9] Cut-in maneuver detection with self-supervised contrastive video representation learning
    Nalcakan, Yagiz
    Bastanlar, Yalin
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (06) : 2915 - 2923
  • [10] Cut-in maneuver detection with self-supervised contrastive video representation learning
    Yagiz Nalcakan
    Yalin Bastanlar
    [J]. Signal, Image and Video Processing, 2023, 17 : 2915 - 2923