Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

被引:6
|
作者
Chen, Minghao [1 ]
Wei, Fangyun [2 ]
Li, Chong [2 ]
Cai, Deng [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, State Key Lab CAD&CG, Hangzhou, Peoples R China
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR52688.2022.01343
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand for learning dense representations for long videos. In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a selfsupervised manner. Concretely, we introduce a simple yet efficient video encoder that considers spatio-temporal context to extract frame-wise representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views obtained through a series of spatio-temporal data augmentations. SCL optimizes the embedding space by minimizing the KL-divergence between the sequence similarity of two augmented views and a prior Gaussian distribution of timestamp distance. Experiments on FineGym, PennAction and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification. Surprisingly, although without training on paired videos, our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks. Code and models are available at https:// github. com/ minghchen/CARL_code.
引用
收藏
页码:13791 / 13800
页数:10
相关论文
共 50 条
  • [1] FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos
    Lee, Joo Chan
    Rho, Daniel
    Ko, Jong Hwan
    Park, Eunbyung
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7859 - 7870
  • [2] Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text Generation
    Wang, Zhihao
    Li, Lin
    Xie, Zhongwei
    Liu, Chuanbo
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241
  • [3] Frame-Wise CNN-Based Filtering for Intra-Frame Quality Enhancement of HEVC Videos
    Huang, Hongyue
    Schiopu, Ionut
    Munteanu, Adrian
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) : 2100 - 2113
  • [4] CONTRASTIVE LOSS BASED FRAME-WISE FEATURE DISENTANGLEMENT FOR POLYPHONIC SOUND EVENT DETECTION
    Guan, Yadong
    Han, Jiqing
    Song, Hongwei
    Song, Wenjie
    Zheng, Guibin
    Zheng, Tieran
    He, Yongjun
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1021 - 1025
  • [5] Frame-Wise Detection of Double HEVC Compression by Learning Deep Spatio-Temporal Representations in Compression Domain
    He, Peisong
    Li, Haoliang
    Wang, Hongxia
    Wang, Shiqi
    Jiang, Xinghao
    Zhang, Ruimei
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 3179 - 3192
  • [6] Empower smart cities with sampling-wise dynamic facial expression recognition via frame-sequence contrastive learning
    Yan, Shaoqi
    Wang, Yan
    Mai, Xinji
    Zhao, Qing
    Song, Wei
    Huang, Jun
    Tao, Zeng
    Wang, Haoran
    Gao, Shuyong
    Zhang, Wenqiang
    COMPUTER COMMUNICATIONS, 2024, 216 : 130 - 139
  • [7] MaCLR: Motion-Aware Contrastive Learning of Representations for Videos
    Xiao, Fanyi
    Tighe, Joseph
    Modolo, Davide
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 353 - 370
  • [8] Few-shot learning for frame-wise phoneme recognition: Adaptation of matching networks
    Banerjee, Tirthankar
    Thurlapati, Narasimha Rao
    Pavithra, V
    Mahalakshmi, S.
    Eledath, Dhanya
    Ramasubramanian, V
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 516 - 520
  • [9] Frame-Wise Action Recognition Training Framework for Skeleton-Based Anomaly Behavior Detection
    Tani, Hiroaki
    Shibata, Tomoyuki
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT III, 2022, 13233 : 312 - 323
  • [10] Feature-Independent Action Spotting Without Human Localization, Segmentation or Frame-wise Tracking
    Sun, Chuan
    Tappen, Marshall
    Foroosh, Hassan
    2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 2689 - 2696