Action-guided prompt tuning for video grounding

被引:0
|
作者
Wang, Jing [1 ]
Tsao, Raymon [2 ]
Wang, Xuan [1 ]
Wang, Xiaojie [1 ]
Feng, Fangxiang [1 ]
Tian, Shiyu [1 ]
Poria, Soujanya [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China
[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China
[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore
基金
中国国家自然科学基金;
关键词
video grounding; Multi-modal learning; Prompt tuning; Temporal information;
D O I
10.1016/j.inffus.2024.102577
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] SPTNET: Span-based Prompt Tuning for Video Grounding
    Zhang, Yiren
    Xu, Yuanwu
    Chen, Mohan
    Zhang, Yuejie
    Feng, Rui
    Gao, Shang
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2807 - 2812
  • [2] Action-guided CycleGAN for Bi-directional Video Prediction
    Verma, Amit
    Meenpal, Toshanlal
    Acharya, Bibhudendra
    IETE TECHNICAL REVIEW, 2024, 41 (05) : 522 - 536
  • [3] ACTOR: Action-Guided Kernel Fuzzing
    Fleischer, Marius
    Das, Dipanjan
    Bose, Priyanka
    Bai, Weiheng
    Lu, Kangjie
    Payer, Mathias
    Kruegel, Christopher
    Vigna, Giovanni
    PROCEEDINGS OF THE 32ND USENIX SECURITY SYMPOSIUM, 2023, : 5003 - 5020
  • [4] On the Approach of the Action-Guided Teaching in Application
    Qian Zhiwang
    2013 INTERNATIONAL CONFERENCE ON APPLIED SOCIAL SCIENCE (ICASS 2013), VOL 4, 2013, : 304 - 311
  • [5] Point Prompt Tuning for Temporally Language Grounding
    Zeng, Yawen
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 2003 - 2007
  • [6] EVA: Enabling Video Attributes With Hierarchical Prompt Tuning for Action Recognition
    Ruan, Xiangning
    Yin, Qixiang
    Su, Fei
    Zhao, Zhicheng
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 971 - 975
  • [7] Compressed Video Prompt Tuning
    Li, Bing
    Chen, Jiaxin
    Bao, Xiuguo
    Huang, Di
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] Action-guided 3D Human Motion Prediction
    Sun, Jiangxin
    Lin, Zihang
    Han, Xintong
    Hu, Jian-Fang
    Xu, Jia
    Zheng, Wei-Shi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [9] Partially dissociable roles of OFC and ACC in stimulus-guided and action-guided decision making
    Khani, Abbas
    JOURNAL OF NEUROPHYSIOLOGY, 2014, 111 (09) : 1717 - 1720
  • [10] Boost Tracking by Natural Language With Prompt-Guided Grounding
    Li, Hengyou
    Liu, Xinyan
    Li, Guorong
    Wang, Shuhui
    Qing, Laiyun
    Huang, Qingming
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2025, 26 (01) : 1088 - 1100