Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

被引:6
|
作者
Chen, Minghao [1 ]
Wei, Fangyun [2 ]
Li, Chong [2 ]
Cai, Deng [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, State Key Lab CAD&CG, Hangzhou, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.01343
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand for learning dense representations for long videos. In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a selfsupervised manner. Concretely, we introduce a simple yet efficient video encoder that considers spatio-temporal context to extract frame-wise representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views obtained through a series of spatio-temporal data augmentations. SCL optimizes the embedding space by minimizing the KL-divergence between the sequence similarity of two augmented views and a prior Gaussian distribution of timestamp distance. Experiments on FineGym, PennAction and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification. Surprisingly, although without training on paired videos, our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks. Code and models are available at https:// github. com/ minghchen/CARL_code.
引用
收藏
页码:13791 / 13800
页数:10
相关论文
共 50 条
  • [41] Disentangled Representations for Cross-Domain Recommendation via Heterogeneous Graph Contrastive Learning
    Liu, Xinyue
    Li, Bohan
    Chen, Yijun
    Li, Xiaoxue
    Xu, Shuai
    Yin, Hongzhi
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2024, PT 3, 2025, 14852 : 35 - 50
  • [42] Action Recognition from Videos with Complex Background via Transfer Learning
    林贤明
    李绍滋
    张洪博
    刘姝
    Journal of Donghua University(English Edition), 2010, 27 (02) : 199 - 203
  • [43] Learning Transferable User Representations with Sequential Behaviors via Contrastive Pre-training
    Cheng, Mingyue
    Yuan, Fajie
    Liu, Qi
    Xin, Xin
    Chen, Enhong
    2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 51 - 60
  • [44] Learning Representations by Contrastive Spatio-Temporal Clustering for Skeleton-Based Action Recognition
    Wang, Mingdao
    Li, Xueming
    Chen, Siqi
    Zhang, Xianlin
    Ma, Lei
    Zhang, Yue
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3207 - 3220
  • [45] STACoRe: Spatio-temporal and action-based contrastive representations for reinforcement learning in Atari
    Lee, Young Jae
    Kim, Jaehoon
    Kwak, Mingu
    Park, Young Joon
    Kim, Seoung Bum
    NEURAL NETWORKS, 2023, 160 : 1 - 11
  • [46] The Time-Sequence Prediction via Temporal and Contextual Contrastive Representation Learning
    Liu, Yang-Yang
    Liu, Jian-Wei
    PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT I, 2022, 13629 : 465 - 476
  • [47] Learning Good State and Action Representations via Tensor Decomposition
    Ni, Chengzhuo
    Zhang, Anru R.
    Duan, Yaqi
    Wang, Mengdi
    2021 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2021, : 1682 - 1687
  • [48] Learning Transferable Self-Attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision
    Zhang, Xiao-Yu
    Shi, Haichao
    Li, Changsheng
    Zheng, Kai
    Zhu, Xiaobin
    Duan, Lixin
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9227 - 9234
  • [49] Learning Event Representations for Zero-Shot Detection via Dual-Contrastive Prompting
    Li, Jiaxu
    Ge, Bin
    Xu, Hao
    Huang, Peixin
    Huang, Hongbin
    MATHEMATICS, 2024, 12 (09)
  • [50] Efficient Action Detection in Untrimmed Videos via Multi-Task Learning
    Zhu, Yi
    Newsam, Shawn
    2017 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2017), 2017, : 197 - 206