Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

被引:7
|
作者
Dong, Sixun [1 ]
Hu, Huazhang [1 ]
Lian, Dongze [2 ]
Luo, Weixin [3 ]
Qian, Yicheng [1 ]
Gao, Shenghua [1 ,4 ,5 ]
机构
[1] ShanghaiTech Univ, Shanghai, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] Meituan, Beijing, Peoples R China
[4] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
[5] Shanghai Engn Res Ctr Energy Efficient & Custom I, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00241
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https: //github.com/svip-lab/WeakSVR.
引用
收藏
页码:2437 / 2447
页数:11
相关论文
共 50 条
  • [1] SCENE REPRESENTATION LEARNING FROM VIDEOS USING SELF-SUPERVISED AND WEAKLY-SUPERVISED TECHNIQUES
    Peri, Raghuveer
    Parthasarathy, Srinivas
    Sundaram, Shiva
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1671 - 1675
  • [2] Weakly-Supervised Alignment of Video With Text
    Bojanowski, P.
    Lajugie, R.
    Grave, E.
    Bach, F.
    Laptev, I.
    Ponce, J.
    Schmid, C.
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4462 - 4470
  • [3] Weakly Supervised Learning of Heterogeneous Concepts in Videos
    Shah, Sohil
    Kulkarni, Kuldeep
    Biswas, Arijit
    Gandhi, Ankit
    Deshmukh, Om
    Davis, Larry S.
    COMPUTER VISION - ECCV 2016, PT VI, 2016, 9910 : 275 - 293
  • [4] Bi-calibration Networks for Weakly-Supervised Video Representation Learning
    Long, Fuchen
    Yao, Ting
    Qiu, Zhaofan
    Tian, Xinmei
    Luo, Jiebo
    Mei, Tao
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (07) : 1704 - 1721
  • [5] Bi-calibration Networks for Weakly-Supervised Video Representation Learning
    Fuchen Long
    Ting Yao
    Zhaofan Qiu
    Xinmei Tian
    Jiebo Luo
    Tao Mei
    International Journal of Computer Vision, 2023, 131 : 1704 - 1721
  • [6] Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency
    Qing, Zhiwu
    Zhang, Shiwei
    Huang, Ziyuan
    Xu, Yi
    Wang, Xiang
    Tang, Mingqian
    Gao, Changxin
    Jin, Rong
    Sang, Nong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13811 - 13821
  • [7] Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method
    Ramos, Washington
    Silva, Michel
    Araujo, Edson
    Moura, Victor
    Oliveira, Keller
    Marcolino, Leandro Soriano
    Nascimento, Erickson R. R.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 2492 - 2504
  • [8] Weakly Supervised Representation Learning with Coarse Labels
    Xu, Yuanhong
    Qian, Qi
    Li, Hao
    Jin, Rong
    Hu, Juhua
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10573 - 10581
  • [9] Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning
    Lu, Yu
    Quan, Ruijie
    Zhu, Linchao
    Yang, Yi
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6748 - 6760
  • [10] Weakly Supervised Action Selection Learning in Video
    Ma, Junwei
    Gorti, Satya Krishna
    Volkovs, Maksims
    Yu, Guangwei
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7583 - 7592