Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics

被引:0
|
作者
Wang, Jiangliu [1 ]
Jiao, Jianbo [2 ]
Bao, Linchao [3 ]
He, Shengfeng [4 ]
Liu, Wei [3 ]
Liu, Yun-hui [1 ]
机构
[1] Chinese Univ Hong Kong CUHK, CUHK T Stone Robot Inst, Hong Kong Ctr Logist Robot, Hong Kong, Peoples R China
[2] Univ Oxford, Dept Engn Sci, Oxford OX1 2JD, England
[3] Tencent AI Lab, Shenzhen 518057, Guangdong, Peoples R China
[4] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou 510641, Guangdong, Peoples R China
基金
英国工程与自然科学研究理事会; 中国国家自然科学基金;
关键词
Task analysis; Three-dimensional displays; Neural networks; Image color analysis; Visualization; Training; Feature extraction; Self-supervised learning; representation learning; video understanding; 3D CNN; RECOGNITION; FLOW;
D O I
10.1109/TPAMI.2021.3057833
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.
引用
收藏
页码:3791 / 3806
页数:16
相关论文
共 50 条
  • [1] Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation
    Zhang, Yujia
    Po, Lai-Man
    Xu, Xuyuan
    Liu, Mengyang
    Wang, Yexin
    Ou, Weifeng
    Zhao, Yuzhi
    Yu, Wing-Yin
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3380 - 3389
  • [2] Video Playback Rate Perception for Self-supervised Spatio-Temporal Representation Learning
    Yao, Yuan
    Liu, Chang
    Luo, Dezhao
    Zhou, Yu
    Ye, Qixiang
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 6547 - 6556
  • [3] Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
    Wang, Jiangliu
    Jiao, Jianbo
    Bao, Linchao
    He, Shengfeng
    Liu, Yunhui
    Liu, Wei
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4001 - 4010
  • [4] Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
    Luo, Dezhao
    Liu, Chang
    Zhou, Yu
    Yang, Dongbao
    Ma, Can
    Ye, Qixiang
    Wang, Weiping
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11701 - 11708
  • [5] SELF-SUPERVISED SPATIO-TEMPORAL REPRESENTATION LEARNING OF SATELLITE IMAGE TIME SERIES
    Dumeur, Iris
    Valero, Silvia
    Inglada, Jordi
    [J]. IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 642 - 645
  • [6] Joint spatio-temporal features constrained self-supervised electrocardiogram representation learning
    Ran, Ao
    Liu, Huafeng
    [J]. BIOMEDICAL ENGINEERING LETTERS, 2024, 14 (02) : 209 - 220
  • [7] Self-Supervised Spatio-Temporal Representation Learning of Satellite Image Time Series
    Dumeur, Iris
    Valero, Silvia
    Inglada, Jordi
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 4350 - 4367
  • [8] Joint spatio-temporal features constrained self-supervised electrocardiogram representation learning
    Ao Ran
    Huafeng Liu
    [J]. Biomedical Engineering Letters, 2024, 14 : 209 - 220
  • [9] Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds
    Huang, Siyuan
    Degrees, Yichen Xie
    Zhu, Song-Chun
    Zhu, Yixin
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6515 - 6525
  • [10] Spatio-Temporal Self-Supervised Learning for Traffic Flow Prediction
    Ji, Jiahao
    Wang, Jingyuan
    Huang, Chao
    Wu, Junjie
    Xu, Boren
    Wu, Zhenhe
    Zhang, Junbo
    Zheng, Yu
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 4, 2023, : 4356 - 4364