Action-guided prompt tuning for video grounding

被引:0
|
作者
Wang, Jing [1 ]
Tsao, Raymon [2 ]
Wang, Xuan [1 ]
Wang, Xiaojie [1 ]
Feng, Fangxiang [1 ]
Tian, Shiyu [1 ]
Poria, Soujanya [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China
[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China
[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore
基金
中国国家自然科学基金;
关键词
video grounding; Multi-modal learning; Prompt tuning; Temporal information;
D O I
10.1016/j.inffus.2024.102577
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Visual Prompt Tuning
    Jia, Menglin
    Tang, Luming
    Chen, Bor-Chun
    Cardie, Claire
    Belongie, Serge
    Hariharan, Bharath
    Lim, Ser-Nam
    COMPUTER VISION - ECCV 2022, PT XXXIII, 2022, 13693 : 709 - 727
  • [32] Prompt-aligned Gradient for Prompt Tuning
    Zhu, Beier
    Niu, Yulei
    Han, Yucheng
    Wu, Yue
    Zhang, Hanwang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15613 - 15623
  • [33] ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding
    Wang, Lan
    Mittal, Gaurav
    Sajeev, Sandra
    Yu, Ye
    Hall, Matthew
    Boddeti, Vishnu Naresh
    Chen, Mei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6575 - 6585
  • [34] UMP: Unified Modality-Aware Prompt Tuning for Text-Video Retrieval
    Zhang, Haonan
    Zeng, Pengpeng
    Gao, Lianli
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11954 - 11964
  • [35] Compound Text-Guided Prompt Tuning via Image-Adaptive Cues
    Tan, Hao
    Li, Jun
    Zhou, Yizhuang
    Wan, Jun
    Lei, Zhen
    Zhang, Xiangyu
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 5061 - 5069
  • [36] Dual Context-Guided Continuous Prompt Tuning for Few-Shot Learning
    Zhou, Jie
    Tian, Lei
    Yu, Houjin
    Zhou, Xiao
    Su, Hui
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 79 - 84
  • [37] DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
    Yang, Xiangpeng
    Zhu, Linchao
    Wang, Xiaohan
    Yang, Yi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6540 - 6548
  • [38] From Video Matching to Video Grounding
    Evangelidis, Georgios
    Diego, Ferran
    Horaud, Radu
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2013, : 608 - 615
  • [39] Safety lapses prompt leisure air grounding
    Aviation Week and Space Technology (New York), 1994, 141 (22):
  • [40] SAFETY LAPSES PROMPT LEISURE AIR GROUNDING
    不详
    AVIATION WEEK & SPACE TECHNOLOGY, 1994, 141 (22): : 32 - 33