Masked Motion Encoding for Self-Supervised Video Representation Learning

被引:2
|
作者
Sun, Xinyu [1 ,2 ]
Chen, Peihao [1 ]
Chen, Liangwei [1 ]
Li, Changhao [1 ]
Li, Thomas H. [6 ]
Tan, Mingkui [1 ,5 ,7 ]
Gan, Chuang [3 ,4 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] Peking Univ, Informat Technol R&D Innovat Ctr, Beijing, Peoples R China
[3] UMass Amherst, Amherst, MA USA
[4] MIT IBM Watson AI Lab, Cambridge, MA USA
[5] Minist Educ, Key Lab Big Data & Intelligent Robot, Beijing, Peoples R China
[6] Peking Univ, Shenzhen Grad Sch, Beijing, Peoples R China
[7] Pazhou Lab, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.00222
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame. To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues. In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Besides, given the sparse video input, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details. Code is available at https://github.com/XinyuSun/MME.
引用
收藏
页码:2235 / 2245
页数:11
相关论文
共 50 条
  • [21] Learning disentangled representation for self-supervised video object segmentation
    Hou, Wenjie
    Qin, Zheyun
    Xi, Xiaoming
    Lu, Xiankai
    Yin, Yilong
    NEUROCOMPUTING, 2022, 481 : 270 - 280
  • [22] Mitigating background bias in self-supervised video representation learning
    Arif Akar
    Ufuk Umut Senturk
    Nazli Ikizler-Cinbis
    Signal, Image and Video Processing, 2025, 19 (1)
  • [23] Self-supervised Co-training for Video Representation Learning
    Han, Tengda
    Xie, Weidi
    Zisserman, Andrew
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [24] ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints
    Das, Srijan
    Ryoo, Michael S.
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 5562 - 5572
  • [25] Temporally coherent embeddings for self-supervised video representation learning
    CSIRO-Data61, Brisbane
    QLD
    4069, Australia
    不详
    QLD
    4000, Australia
    不详
    QLD
    4072, Australia
    arXiv,
  • [26] Motion-guided spatiotemporal multitask feature discrimination for self-supervised video representation learning
    Bi, Shuai
    Hu, Zhengping
    Zhang, Hehao
    Di, Jirui
    Sun, Zhe
    PATTERN RECOGNITION, 2024, 155
  • [27] SELF-SUPERVISED REPRESENTATION LEARNING FOR MOTION CONTROL OF AUTONOMOUS VEHICLES
    Ayalew, Melese
    Zhou, Shire
    Assefa, Maregu
    Gedamu, Kumie
    Yilma, Getinet
    2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
  • [28] Cross-View Masked Model for Self-Supervised Graph Representation Learning
    Duan H.
    Yu B.
    Xie C.
    IEEE Transactions on Artificial Intelligence, 2024, 5 (11): : 1 - 13
  • [29] Masked self-supervised ECG representation learning via multiview information bottleneck
    Yang, Shunxiang
    Lian, Cheng
    Zeng, Zhigang
    Xu, Bingrong
    Su, Yixin
    Xue, Chenyang
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (14): : 7625 - 7637
  • [30] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
    Hsu, Wei-Ning
    Bolte, Benjamin
    Tsai, Yao-Hung Hubert
    Lakhotia, Kushal
    Salakhutdinov, Ruslan
    Mohamed, Abdelrahman
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460