Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

被引:7
|
作者
Dong, Sixun [1 ]
Hu, Huazhang [1 ]
Lian, Dongze [2 ]
Luo, Weixin [3 ]
Qian, Yicheng [1 ]
Gao, Shenghua [1 ,4 ,5 ]
机构
[1] ShanghaiTech Univ, Shanghai, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] Meituan, Beijing, Peoples R China
[4] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
[5] Shanghai Engn Res Ctr Energy Efficient & Custom I, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00241
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https: //github.com/svip-lab/WeakSVR.
引用
收藏
页码:2437 / 2447
页数:11
相关论文
共 50 条
  • [31] Cross-epoch learning for weakly supervised anomaly detection in surveillance videos
    Yu, Shenghao
    Wang, Chong
    Mao, Qiaomei
    Li, Yuqi
    Wu, Jiafei
    IEEE Signal Processing Letters, 2021, 28 : 2137 - 2141
  • [32] Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos
    Wu, Jie
    Li, Guanbin
    Han, Xiaoguang
    Lin, Liang
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1283 - 1291
  • [33] Cross-Epoch Learning for Weakly Supervised Anomaly Detection in Surveillance Videos
    Yu, Shenghao
    Wang, Chong
    Mao, Qiaomei
    Li, Yuqi
    Wu, Jiafei
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 2137 - 2141
  • [34] Relabeling Abnormal Videos via Intra-Video Label Propagation for Weakly Supervised Video Anomaly Detection
    Thou, Wenhao
    Li, Yingxuan
    Zhao, Jiancheng
    Zhao, Chunhui
    2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024, 2024, : 1200 - 1205
  • [35] Weakly-supervised Temporal Path Representation Learning with Contrastive Curriculum Learning
    Yang, Sean Bin
    Guo, Chenjuan
    Hu, Jilin
    Yang, Bin
    Tang, Jian
    Jensen, Christian S.
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 2873 - 2885
  • [36] Detecting Fall Actions of Videos by Using Weakly-Supervised Learning and Unsupervised Clustering Learning
    Zhou, Jiaxin
    Komuro, Takashi
    ADVANCES IN VISUAL COMPUTING, ISVC 2022, PT I, 2022, 13598 : 313 - 324
  • [37] Weakly supervised object localization and segmentation in videos
    Rochan, Mrigank
    Rahman, Shafin
    Bruce, Neil D. B.
    Wang, Yang
    IMAGE AND VISION COMPUTING, 2016, 56 : 1 - 12
  • [38] Weakly Supervised Dense Event Captioning in Videos
    Duan, Xuguang
    Huang, Wenbing
    Gan, Chuang
    Wang, Jingdong
    Zhu, Wenwu
    Huang, Junzhou
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [39] Multi-representation fusion learning for weakly supervised semantic segmentation
    Li, Yongqiang
    Hu, Chuanping
    Ren, Kai
    Xi, Hao
    Fan, Jinhao
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 277
  • [40] Weakly Supervised Disentangled Representation for Goal-Conditioned Reinforcement Learning
    Qian, Zhifeng
    You, Mingyu
    Zhou, Hongjun
    He, Bin
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (02): : 2202 - 2209