Learning Feature Semantic Matching for Spatio-Temporal Video Grounding

被引:0
|
作者
Zhang, Tong [1 ]
Fang, Hao [1 ,2 ]
Zhang, Hao [3 ]
Gao, Jialin [3 ]
Lu, Xiankai [1 ]
Nie, Xiushan [4 ,5 ]
Yin, Yilong [1 ]
机构
[1] Shandong Univ, Sch Software, Jinan 250101, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
[4] Shandong Yunhai Guochuang Cloud Comp Equipment Ind, Jinan 250101, Peoples R China
[5] Shandong Jianzhu Univ, Sch Comp Sci & Technol, Jinan 250014, Peoples R China
基金
中国国家自然科学基金;
关键词
Grounding; Feature extraction; Transformers; Task analysis; Electron tubes; Decoding; Semantics; Spatio-temporal video grounding; multi-modal attention; contrastive loss;
D O I
10.1109/TMM.2024.3387696
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube, including temporal boundaries and object bounding boxes, that semantically corresponds to a given language description in an untrimmed video. The existing one-stage solutions in this task face two significant challenges, namely, vision-text semantic misalignment and spatial mislocalization, which limit their performance in grounding. These two limitations are mainly caused by neglect of fine-grained alignment in cross-modality fusion and the reliance on a text-agnostic query in sequentially spatial localization. To address these issues, we propose an effective model with a newly designed Feature Semantic Matching (FSM) module based on a Transformer architecture to address the above issues. Our method introduces a cross-modal feature matching module to achieve multi-granularity alignment between video and text while preventing the weakening of important features during the feature fusion stage. Additionally, we design a query-modulated matching module to facilitate text-relevant tube construction by multiple query generation and tubulet sequence matching. To ensure the quality of tube construction, we employ a novel mismatching rectify contrastive loss to rectify the mismatching between the learnable query and the objects corresponding to the text descriptions by restricting the generated spatial query. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods on two challenging STVG benchmarks.
引用
收藏
页码:9268 / 9279
页数:12
相关论文
共 50 条
  • [31] Grounding Spatio-Temporal Language with Transformers
    Karch, Tristan
    Teodorescu, Laetitia
    Hofmann, Katja
    Moulin-Frier, Clement
    Oudeyer, Pierre-Yves
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [32] Spatio-temporal coincidence and the grounding problem
    Bennett, K
    [J]. PHILOSOPHICAL STUDIES, 2004, 118 (03) : 339 - 371
  • [33] Formally grounding spatio-temporal thinking
    Klippel, Alexander
    Wallgruen, Jan Oliver
    Yang, Jinlong
    Li, Rui
    Dylla, Frank
    [J]. COGNITIVE PROCESSING, 2012, 13 : S44 - S44
  • [34] Video copy detection using spatio-temporal sequence matching
    Kim, C
    [J]. STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 70 - 79
  • [35] Spatio-temporal feature learning for enhancing video quality based on screen content characteristics
    Huang, Ziyin
    Chan, Yui-Lam
    Tsang, Sik-Ho
    Kwong, Ngai-Wing
    Lam, Kin-Man
    Ling, Wing-Kuen
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 104
  • [36] Formally grounding spatio-temporal thinking
    Klippel, Alexander
    Wallgruen, Jan Oliver
    Yang, Jinlong
    Li, Rui
    Dylla, Frank
    [J]. COGNITIVE PROCESSING, 2012, 13 : S209 - S214
  • [37] Spatio-Temporal Coincidence and the Grounding Problem
    Karen Bennett
    [J]. Philosophical Studies, 2004, 118 : 339 - 371
  • [38] Video Error Concealment Using Spatio-Temporal Boundary Matching
    Xiang Youjun
    Lei Na
    Feng Liangmou
    Xie Shengli
    [J]. PROCEEDINGS OF THE 2009 2ND INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, VOLS 1-9, 2009, : 863 - 867
  • [39] Formally grounding spatio-temporal thinking
    Alexander Klippel
    Jan Oliver Wallgrün
    Jinlong Yang
    Rui Li
    Frank Dylla
    [J]. Cognitive Processing, 2012, 13 : 209 - 214
  • [40] End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus
    Gao, Yingqi
    Luo, Zhiling
    Chen, Shiqian
    Zhou, Wei
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3958 - 3962