HiSA: Hierarchically Semantic Associating for Video Temporal Grounding

被引:11
|
作者
Xu, Zhe [1 ]
Chen, Da [2 ]
Wei, Kun [1 ]
Deng, Cheng [1 ]
Xue, Hui [2 ]
机构
[1] Xidian Univ, Sch Elect Engn, Xian 710071, Peoples R China
[2] Alibaba Grp, Hangzhou 311121, Peoples R China
基金
中国国家自然科学基金;
关键词
Grounding; Feature extraction; Proposals; Task analysis; Semantics; Representation learning; Image segmentation; Video temporal grounding; feature disentanglement; cross-guided contrast; LANGUAGE; LOCALIZATION; IMAGE;
D O I
10.1109/TIP.2022.3191841
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Temporal Grounding (VTG) aims to locate the time interval in a video that is semantically relevant to a language query. Existing VTG methods interact the query with entangled video features and treat the instances in a dataset independently. However, intra-video entanglement and inter-video connection are rarely considered in these methods, leading to mismatches between the video and language. To this end, we propose a novel method, dubbed Hierarchically Semantic Associating (HiSA), which aims to precisely align the video with language and obtain discriminative representation for further location regression. Specifically, the action factors and background factors are disentangled from adjacent video segments, enforcing precise multimodal interaction and alleviating the intra-video entanglement. In addition, cross-guided contrast is elaborately framed to capture the inter-video connection, which benefits the multimodal understanding to locate the time interval. Extensive experiments on three benchmark datasets demonstrate that our approach significantly outperforms the state-of-the-art methods. The project page is available at: https://github.com/zhexu1997/HiSA.
引用
收藏
页码:5178 / 5188
页数:11
相关论文
共 50 条
  • [1] Unsupervised Temporal Video Grounding with Deep Semantic Clustering
    Liu, Daizong
    Qu, Xiaoye
    Wang, Yinzhen
    Di, Xing
    Zou, Kai
    Cheng, Yu
    Xu, Zichuan
    Zhou, Pan
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1683 - 1691
  • [2] ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding
    Wang, Lan
    Mittal, Gaurav
    Sajeev, Sandra
    Yu, Ye
    Hall, Matthew
    Boddeti, Vishnu Naresh
    Chen, Mei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6575 - 6585
  • [3] Learning Feature Semantic Matching for Spatio-Temporal Video Grounding
    Zhang, Tong
    Fang, Hao
    Zhang, Hao
    Gao, Jialin
    Lu, Xiankai
    Nie, Xiushan
    Yin, Yilong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9268 - 9279
  • [4] Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition
    Wang, Weikang
    Liu, Jing
    Su, Yuting
    Nie, Weizhi
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4867 - 4876
  • [5] An empirical study of the effect of video encoders on Temporal Video Grounding
    De la Jara, Ignacio M.
    Rodriguez-Opazo, Cristian
    Marrese-Taylor, Edison
    Bravo-Marquez, Felipe
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2842 - 2847
  • [6] Hierarchically Supervised Deconvolutional Network for Semantic Video Segmentation
    Wang, Yuhang
    Liu, Jing
    Li, Yong
    Fu, Jun
    Xu, Min
    Lu, Hanqing
    PATTERN RECOGNITION, 2017, 64 : 437 - 445
  • [7] Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding
    Tan, Chaolei
    Lin, Zihang
    Hu, Jian-Fang
    Zheng, Wei-Shi
    Lai, Jianhuang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18973 - 18982
  • [8] Modular Action Concept Grounding in Semantic Video Prediction
    Yu, Wei
    Chen, Wenxin
    Yin, Songheng
    Easterbrook, Steve
    Garg, Animesh
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3595 - 3604
  • [9] Point-Supervised Video Temporal Grounding
    Xu, Zhe
    Wei, Kun
    Yang, Xu
    Deng, Cheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6121 - 6131
  • [10] SDN: Semantic Decoupling Network for Temporal Language Grounding
    Jiang, Xun
    Xu, Xing
    Zhang, Jingran
    Shen, Fumin
    Cao, Zuo
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (05) : 6598 - 6612