Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

被引:6
|
作者
Chen, Minghao [1 ]
Wei, Fangyun [2 ]
Li, Chong [2 ]
Cai, Deng [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, State Key Lab CAD&CG, Hangzhou, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.01343
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand for learning dense representations for long videos. In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a selfsupervised manner. Concretely, we introduce a simple yet efficient video encoder that considers spatio-temporal context to extract frame-wise representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views obtained through a series of spatio-temporal data augmentations. SCL optimizes the embedding space by minimizing the KL-divergence between the sequence similarity of two augmented views and a prior Gaussian distribution of timestamp distance. Experiments on FineGym, PennAction and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification. Surprisingly, although without training on paired videos, our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks. Code and models are available at https:// github. com/ minghchen/CARL_code.
引用
收藏
页码:13791 / 13800
页数:10
相关论文
共 50 条
  • [31] Learning Discriminative Representations in Videos via Active Embedding Distance Correlation
    Zhao, Qingsong
    Wang, Yi
    He, Yinan
    Qiao, Yu
    Zhao, Cairong
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 56 - 60
  • [32] A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space
    Zhang, Yuhao
    Zhu, Hongji
    Wang, Yongliang
    Xu, Nan
    Li, Xiaobo
    Zhao, BinQiang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4892 - 4903
  • [33] End-to-end Learning of Action Detection from Frame Glimpses in Videos
    Yeung, Serena
    Russakovsky, Olga
    Mori, Greg
    Li Fei-Fei
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2678 - 2687
  • [34] Continual Nuclei Segmentation via Prototype-Wise Relation Distillation and Contrastive Learning
    Wu, Huisi
    Wang, Zhaoze
    Zhao, Zebin
    Chen, Cheng
    Qin, Jing
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (12) : 3794 - 3804
  • [35] FTAN: Frame-to-frame temporal alignment network with contrastive learning for few-shot action recognition
    Yu, Bin
    Hou, Yonghong
    Guo, Zihui
    Gao, Zhiyi
    Li, Yueyang
    IMAGE AND VISION COMPUTING, 2024, 149
  • [36] Boosting Zero-Shot Learning via Contrastive Optimization of Attribute Representations
    Du, Yu
    Shi, Miaojing
    Wei, Fangyun
    Li, Guoqi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (11) : 1 - 14
  • [37] Time course of learning sequence representations in action imagery practice
    Dahm, Stephan F.
    Rieger, Martina
    HUMAN MOVEMENT SCIENCE, 2023, 87
  • [38] Boosting Zero-Shot Learning via Contrastive Optimization of Attribute Representations
    Du, Yu
    Shi, Miaojing
    Wei, Fangyun
    Li, Guoqi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 16706 - 16719
  • [39] Long Context Question Answering via Supervised Contrastive Learning
    Caciularu, Avi
    Dagan, Ido
    Goldberger, Jacob
    Cohan, Arman
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2872 - 2879
  • [40] Action-conditioned contrastive learning for 3D human pose and shape estimation in videos
    Song, Inpyo
    Ryu, Moonwook
    Lee, Jangwon
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249