Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

被引:7
|
作者
Dong, Sixun [1 ]
Hu, Huazhang [1 ]
Lian, Dongze [2 ]
Luo, Weixin [3 ]
Qian, Yicheng [1 ]
Gao, Shenghua [1 ,4 ,5 ]
机构
[1] ShanghaiTech Univ, Shanghai, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] Meituan, Beijing, Peoples R China
[4] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
[5] Shanghai Engn Res Ctr Energy Efficient & Custom I, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00241
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https: //github.com/svip-lab/WeakSVR.
引用
收藏
页码:2437 / 2447
页数:11
相关论文
共 50 条
  • [41] A weakly supervised representation learning for modulation recognition of short duration signals
    Hosseinzadeh, Hamidreza
    Einalou, Zahra
    Razzazi, Farbod
    MEASUREMENT, 2021, 178
  • [42] Mastitis Classification in Dairy Cows Using Weakly Supervised Representation Learning
    Cho, Soo-Hyun
    Lee, Mingyung
    Lee, Wang-Hee
    Seo, Seongwon
    Lee, Dae-Hyun
    AGRICULTURE-BASEL, 2024, 14 (11):
  • [43] Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
    Parekh, Sanjeel
    Essid, Slim
    Ozerov, Alexey
    Ngoc Q K Duong
    Perez, Patrick
    Richard, Gael
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 (28) : 416 - 428
  • [44] Weakly Supervised Brain Lesion Segmentation via Attentional Representation Learning
    Wu, Kai
    Du, Bowen
    Luo, Man
    Wen, Hongkai
    Shen, Yiran
    Feng, Jianfeng
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2019, PT III, 2019, 11766 : 211 - 219
  • [45] Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
    Lv, Hui
    Yue, Zhongqi
    Sun, Qianru
    Luo, Bin
    Cui, Zhen
    Zhang, Hanwang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 8022 - 8031
  • [46] Patch-wise Weakly Supervised Learning for Object Localization in Video
    Dong Huh
    Kim, Taekyung
    Kim, Jaeil
    2019 1ST INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (ICAIIC 2019), 2019, : 263 - 266
  • [47] Robust fall detection in video surveillance based on weakly supervised learning
    Wu, Lian
    Huang, Chao
    Zhao, Shuping
    Li, Jinkai
    Zhao, Jianchuan
    Cui, Zhongwei
    Yu, Zhen
    Xu, Yong
    Zhang, Min
    NEURAL NETWORKS, 2023, 163 : 286 - 297
  • [48] Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval
    Yin, Sh ukang
    Zhao, Sirui
    Wang, Hao
    Xu, Tong
    Chen, Enhong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (10)
  • [49] Blind Video Quality Assessment With Weakly Supervised Learning and Resampling Strategy
    Zhang, Yu
    Gao, Xinbo
    He, Lihuo
    Lu, Wen
    He, Ran
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (08) : 2244 - 2255
  • [50] Collaborative Normality Learning Framework for Weakly Supervised Video Anomaly Detection
    Liu, Yang
    Liu, Jing
    Zhao, Mengyang
    Li, Shuang
    Song, Liang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2022, 69 (05) : 2508 - 2512