Action-guided prompt tuning for video grounding

被引:0
|
作者
Wang, Jing [1 ]
Tsao, Raymon [2 ]
Wang, Xuan [1 ]
Wang, Xiaojie [1 ]
Feng, Fangxiang [1 ]
Tian, Shiyu [1 ]
Poria, Soujanya [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China
[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China
[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore
基金
中国国家自然科学基金;
关键词
video grounding; Multi-modal learning; Prompt tuning; Temporal information;
D O I
10.1016/j.inffus.2024.102577
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.
引用
收藏
页数:10
相关论文
共 50 条
  • [11] Video-Guided Curriculum Learning for Spoken Video Grounding
    Xia, Yan
    Zhao, Zhou
    Ye, Shangwei
    Zhao, Yang
    Li, Haoyuan
    Ren, Yi
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5191 - 5200
  • [12] Choice Coding in Frontal Cortex during Stimulus-Guided or Action-Guided Decision-Making
    Luk, Chung-Hay
    Wallis, Jonathan D.
    JOURNAL OF NEUROSCIENCE, 2013, 33 (05): : 1864 - 1871A
  • [13] Learning Action-guided Spatio-temporal Transformer for Group Activity Recognition
    Li, Wei
    Yang, Tianzhao
    Wu, Xiao
    Du, Xian-Jun
    Qiao, Jian-Jun
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 2051 - 2060
  • [14] Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation
    Jin, Peng
    Li, Hao
    Cheng, Zesen
    Li, Kehan
    Yu, Runyi
    Liu, Chang
    Ji, Xiangyang
    Yuan, Li
    Chen, Jie
    COMPUTER VISION - ECCV 2024, PT XXV, 2025, 15083 : 392 - 409
  • [15] Towards Visual-Prompt Temporal Answer Grounding in Instructional Video
    Li, Shutao
    Li, Bin
    Sun, Bin
    Weng, Yixuan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 8836 - 8853
  • [16] Temporally Language Grounding With Multi-Modal Multi-Prompt Tuning
    Zeng, Yawen
    Han, Ning
    Pan, Keyu
    Jin, Qin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3366 - 3377
  • [17] ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization
    Wang, Hao
    Liu, Fang
    Jiao, Licheng
    Wang, Jiahao
    Hao, Zehua
    Li, Shuo
    Li, Lingling
    Chen, Puhua
    Liu, Xu
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5390 - 5400
  • [18] Modular Action Concept Grounding in Semantic Video Prediction
    Yu, Wei
    Chen, Wenxin
    Yin, Songheng
    Easterbrook, Steve
    Garg, Animesh
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3595 - 3604
  • [19] ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting
    Zheng, Hongwei
    Li, Han
    Shi, Bowen
    Dai, Wenrui
    Wang, Botao
    Sun, Yu
    Guo, Min
    Xiong, Hongkai
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2657 - 2662
  • [20] BIGaze: An eye-gaze action-guided Bayesian information gain framework for information exploration
    Lee, Seung Won
    Kim, Hwan
    Yi, Taeha
    Hyun, Kyung Hoon
    ADVANCED ENGINEERING INFORMATICS, 2023, 58