Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

被引:7
|
作者
Dong, Sixun [1 ]
Hu, Huazhang [1 ]
Lian, Dongze [2 ]
Luo, Weixin [3 ]
Qian, Yicheng [1 ]
Gao, Shenghua [1 ,4 ,5 ]
机构
[1] ShanghaiTech Univ, Shanghai, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] Meituan, Beijing, Peoples R China
[4] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
[5] Shanghai Engn Res Ctr Energy Efficient & Custom I, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00241
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https: //github.com/svip-lab/WeakSVR.
引用
收藏
页码:2437 / 2447
页数:11
相关论文
共 50 条
  • [21] Deep Text Prior: Weakly Supervised Learning for Assertion Classification
    Liventsev, Vadim
    Fedulova, Irina
    Dylov, Dmitry
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: WORKSHOP AND SPECIAL SESSIONS, 2019, 11731 : 243 - 257
  • [22] Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
    Lialin, Vladislav
    Rawls, Stephen
    Chan, David
    Ghosh, Shalini
    Rumshisky, Anna
    Hamza, Wael
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 390 - 400
  • [23] Weakly supervised graph learning for action recognition in untrimmed video
    Yao, Xiao
    Zhang, Jia
    Chen, Ruixuan
    Zhang, Dan
    Zeng, Yifeng
    VISUAL COMPUTER, 2023, 39 (11): : 5469 - 5483
  • [24] NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning
    Richard, Alexander
    Kuehne, Hilde
    Iqbal, Ahsan
    Gall, Juergen
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7386 - 7395
  • [25] Weakly Supervised Semantic Segmentation Learning on UAV Video Sequences
    Blaga, Bianca-Cerasela-Zelia
    Nedevschi, Sergiu
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 731 - 735
  • [26] Mining relational data from text: From strictly supervised to weakly supervised learning
    Zhang, Zhu
    INFORMATION SYSTEMS, 2008, 33 (03) : 300 - 314
  • [27] Weakly supervised graph learning for action recognition in untrimmed video
    Xiao Yao
    Jia Zhang
    Ruixuan Chen
    Dan Zhang
    Yifeng Zeng
    The Visual Computer, 2023, 39 : 5469 - 5483
  • [28] Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos
    Koller, Oscar
    Camgoz, Necati Cihan
    Ney, Hermann
    Bowden, Richard
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (09) : 2306 - 2320
  • [29] SELF-SUPERVISED REPRESENTATION LEARNING FOR ULTRASOUND VIDEO
    Jiao, Jianbo
    Droste, Richard
    Drukker, Lior
    Papageorghiou, Aris T.
    Noble, J. Alison
    2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 1847 - 1850
  • [30] Self-Supervised Video Representation Learning by Video Incoherence Detection
    Cao, Haozhi
    Xu, Yuecong
    Mao, Kezhi
    Xie, Lihua
    Yin, Jianxiong
    See, Simon
    Xu, Qianwen
    Yang, Jianfei
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3810 - 3822