Action-guided prompt tuning for video grounding

被引：0

作者：

Wang, Jing ^{[1
]}

Tsao, Raymon ^{[2
]}

Wang, Xuan ^{[1
]}

Wang, Xiaojie ^{[1
]}

Feng, Fangxiang ^{[1
]}

Tian, Shiyu ^{[1
]}

Poria, Soujanya ^{[3
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Xitucheng Rd 10, Beijing 100876, Peoples R China

[2] Peking Univ, 5 Yiheyuan Rd, Beijing 100871, Peoples R China

[3] Singapore Univ Technol & Design, Sch Informat Syst Technol & Design, 8 Somapah Rd, Singapore 487372, Singapore

来源：

INFORMATION FUSION | 2025年 / 113卷

基金：

中国国家自然科学基金;

关键词：

video grounding; Multi-modal learning; Prompt tuning; Temporal information;

D O I：

10.1016/j.inffus.2024.102577

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video grounding aims to locate a moment-of-interest semantically corresponding to a given query. We claim that existing methods overlook two critical issues: (1) the sparsity of language, and (2) the human perception process of events. To be specific, previous studies forcibly map the video modality and language modality into a joint space for alignment, disregarding their inherent disparities. Verbs play a crucial role in queries, providing discriminative information for distinguishing different videos. However, in the video modality, actions especially salient ones, are typically manifested through a greater number of frames, encompassing a richer reservoir of informative details. At the query level, verbs are constrained to a single word representation,creating a disparity. This discrepancy highlights a significant sparsity in language features, resulting in the suboptimality of mapping the two modalities into a shared space naively. Furthermore, segmenting ongoing activity into meaningful events is integral to human perception and contributes event memory. Preceding methods fail to account for this essential perception process. Considering the aforementioned issues, we propose a novel Action-Guided Prompt Tuning (AGPT) method for video grounding. Firstly, we design a Prompt Exploration module to explore latent expansion information of salient verbs language,thereby reducing the language feature sparsity and facilitating cross-modal matching. Secondly, we design the auxiliary task of action temporal prediction for video grounding and introduce a temporal rank loss function to simulate the human perceptual system's segmentation of events, rendering our AGPT to be temporal-aware. Our approach can be seamlessly integrated into any video grounding model with minimal additional parameters. Extensive ablation experiments on three backbones and three datasets demonstrate the superiority of our method.

引用

页数：10

共 50 条

[41] Grounding humanoid visually guided walking: From action-independent to action-oriented knowledge
Chame, Hendry Ferreira
Chevallereau, Christine
INFORMATION SCIENCES, 2016, 352 : 79 - 97
[42] Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition
Wang, Weikang
Liu, Jing
Su, Yuting
Nie, Weizhi
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4867 - 4876
[43] DePT: Decoupled Prompt Tuning
Zhang, Ji
Wu, Shihan
Gao, Lianli
Shen, Heng Tao
Song, Jingkuan
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12924 - 12933
[44] Universality and Limitations of Prompt Tuning
Wang, Yihan
Chauhan, Jatin
Wang, Wei
Hsieh, Cho-Jui
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[45] Grounding language in action
Arthur M. Glenberg
Michael P. Kaschak
Psychonomic Bulletin & Review, 2002, 9 : 558 - 565
[46] Grounding Action Representations
Weber, Arne M.
Vosgerau, Gottfried
REVIEW OF PHILOSOPHY AND PSYCHOLOGY, 2012, 3 (01) : 53 - 69
[47] Grounding Action Representations
Arne M. Weber
Gottfried Vosgerau
Review of Philosophy and Psychology, 2012, 3 (1) : 53 - 69
[48] Grounding the self in action
Knoblich, G
Elsner, B
Aschersleben, G
Metzinger, T
CONSCIOUSNESS AND COGNITION, 2003, 12 (04) : 487 - 494
[49] Grounding Language in Action
Rohlfing, Katharina J.
Tani, Jun
IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, 2011, 3 (02) : 109 - 112
[50] Grounding language in action
Glenberg, AM
Kaschak, MP
PSYCHONOMIC BULLETIN & REVIEW, 2002, 9 (03) : 558 - 565

← 1 2 3 4 5 →