Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics

被引:0
|
作者
Wang, Jiangliu [1 ]
Jiao, Jianbo [2 ]
Bao, Linchao [3 ]
He, Shengfeng [4 ]
Liu, Wei [3 ]
Liu, Yun-hui [1 ]
机构
[1] Chinese Univ Hong Kong CUHK, CUHK T Stone Robot Inst, Hong Kong Ctr Logist Robot, Hong Kong, Peoples R China
[2] Univ Oxford, Dept Engn Sci, Oxford OX1 2JD, England
[3] Tencent AI Lab, Shenzhen 518057, Guangdong, Peoples R China
[4] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou 510641, Guangdong, Peoples R China
基金
英国工程与自然科学研究理事会; 中国国家自然科学基金;
关键词
Task analysis; Three-dimensional displays; Neural networks; Image color analysis; Visualization; Training; Feature extraction; Self-supervised learning; representation learning; video understanding; 3D CNN; RECOGNITION; FLOW;
D O I
10.1109/TPAMI.2021.3057833
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.
引用
收藏
页码:3791 / 3806
页数:16
相关论文
共 50 条
  • [21] Implicitly using Human Skeleton in Self-supervised Learning: Influence on Spatio-temporal Puzzle Solving and on Video Action Recognition
    Riand, Mathieu
    Dolle, Laurent
    Le Callet, Patrick
    [J]. PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON ROBOTICS, COMPUTER VISION AND INTELLIGENT SYSTEMS (ROBOVIS), 2021, : 128 - 135
  • [22] Video Face Clustering with Self-Supervised Representation Learning
    Sharma, Vivek
    Tapaswi, Makarand
    Saquib Sarfraz, M.
    Stiefelhagen, Rainer
    [J]. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2020, 2 (02): : 145 - 157
  • [23] Self-Supervised Representation Learning for Video Quality Assessment
    Jiang, Shaojie
    Sang, Qingbing
    Hu, Zongyao
    Liu, Lixiong
    [J]. IEEE TRANSACTIONS ON BROADCASTING, 2023, 69 (01) : 118 - 129
  • [24] Video Motion Perception for Self-supervised Representation Learning
    Li, Wei
    Luo, Dezhao
    Fang, Bo
    Li, Xiaoni
    Zhou, Yu
    Wang, Weiping
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 508 - 520
  • [25] Spatio-Temporal Crop Aggregation for Video Representation Learning
    Sameni, Sepehr
    Jenni, Simon
    Favaro, Paolo
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5641 - 5651
  • [26] Video representation learning by identifying spatio-temporal transformations
    Sheng Geng
    Shimin Zhao
    Hu Liu
    [J]. Applied Intelligence, 2022, 52 : 6613 - 6622
  • [27] Video representation learning by identifying spatio-temporal transformations
    Geng, Sheng
    Zhao, Shimin
    Liu, Hu
    [J]. APPLIED INTELLIGENCE, 2022, 52 (06) : 6613 - 6622
  • [28] Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning
    Zhang, Zehua
    Crandall, David
    [J]. 2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 975 - 985
  • [29] CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video Hashing
    Wei, Rukai
    Liu, Yu
    Song, Jingkuan
    Cui, Heng
    Xie, Yanzhao
    Zhou, Ke
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 1677 - 1688
  • [30] Static and Dynamic Concepts for Self-supervised Video Representation Learning
    Qian, Rui
    Ding, Shuangrui
    Liu, Xian
    Lin, Dahua
    [J]. COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 145 - 164