Masked Motion Encoding for Self-Supervised Video Representation Learning

被引：2

作者：

Sun, Xinyu ^{[1
,2
]}

Chen, Peihao ^{[1
]}

Chen, Liangwei ^{[1
]}

Li, Changhao ^{[1
]}

Li, Thomas H. ^{[6
]}

Tan, Mingkui ^{[1
,5
,7
]}

Gan, Chuang ^{[3
,4
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

[2] Peking Univ, Informat Technol R&D Innovat Ctr, Beijing, Peoples R China

[3] UMass Amherst, Amherst, MA USA

[4] MIT IBM Watson AI Lab, Cambridge, MA USA

[5] Minist Educ, Key Lab Big Data & Intelligent Robot, Beijing, Peoples R China

[6] Peking Univ, Shenzhen Grad Sch, Beijing, Peoples R China

[7] Pazhou Lab, Guangzhou, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52729.2023.00222

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame. To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues. In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Besides, given the sparse video input, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details. Code is available at https://github.com/XinyuSun/MME.

引用

页码：2235 / 2245

页数：11

共 50 条

[1] Video Motion Perception for Self-supervised Representation Learning
Li, Wei
Luo, Dezhao
Fang, Bo
Li, Xiaoni
Zhou, Yu
Wang, Weiping
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 508 - 520
[2] Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Wang, Rui
Chen, Dongdong
Wu, Zuxuan
Chen, Yinpeng
Dai, Xiyang
Liu, Mengchen
Yuan, Lu
Jiang, Yu-Gang
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6312 - 6322
[3] Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Ni, Jingcheng
Zhou, Nan
Qin, Jie
Wu, Qian
Liu, Junqi
Li, Boxun
Huang, Di
[J]. COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 457 - 474
[4] Self-supervised Video Representation Learning by Context and Motion Decoupling
Huang, Lianghua
Liu, Yu
Wang, Bin
Pan, Pan
Xu, Yinghui
Jin, Rong
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13881 - 13890
[5] Enhancing motion visual cues for self-supervised video representation learning
Nie, Mu
Quan, Zhibin
Ding, Weiping
Yang, Wankou
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
[6] SELF-SUPERVISED REPRESENTATION LEARNING FOR ULTRASOUND VIDEO
Jiao, Jianbo
Droste, Richard
Drukker, Lior
Papageorghiou, Aris T.
Noble, J. Alison
[J]. 2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 1847 - 1850
[7] Self-Supervised Video Representation Learning by Video Incoherence Detection
Cao, Haozhi
Xu, Yuecong
Mao, Kezhi
Xie, Lihua
Yin, Jianxiong
See, Simon
Xu, Qianwen
Yang, Jianfei
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3810 - 3822
[8] Video Face Clustering with Self-Supervised Representation Learning
Sharma V.
Tapaswi M.
Saquib Sarfraz M.
Stiefelhagen R.
[J]. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2020, 2 (02): : 145 - 157
[9] Self-Supervised Representation Learning for Video Quality Assessment
Jiang, Shaojie
Sang, Qingbing
Hu, Zongyao
Liu, Lixiong
[J]. IEEE TRANSACTIONS ON BROADCASTING, 2023, 69 (01) : 118 - 129
[10] Self-Supervised Motion Perception for Spatiotemporal Representation Learning
Liu, Chang
Yao, Yuan
Luo, Dezhao
Zhou, Yu
Ye, Qixiang
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 9832 - 9846

← 1 2 3 4 5 →