Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引:6
|
作者
Wang, Haoran [1 ]
Xu, Di [2 ]
He, Dongliang [1 ]
Li, Fu [1 ]
Ji, Zhong [3 ]
Han, Jungong [4 ]
Ding, Errui [1 ]
机构
[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales
关键词
Video-Text Retrieval; High-level Semantics; Vision-language Understanding;
D O I
10.1145/3503161.3548010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.
引用
收藏
页码:4887 / 4898
页数:12
相关论文
共 50 条
  • [21] Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
    Hao, Xiaoshuai
    Zhang, Wanqian
    Wu, Dayan
    Zhu, Fei
    Li, Bo
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18962 - 18972
  • [22] Complementarity-Aware Space Learning for Video-Text Retrieval
    Zhu, Jinkuan
    Zeng, Pengpeng
    Gao, Lianli
    Li, Gongfu
    Liao, Dongliang
    Song, Jingkuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4362 - 4374
  • [23] Uncertainty-Aware with Negative Samples for Video-Text Retrieval
    Song, Weitao
    Chen, Weiran
    Xu, Jialiang
    Ji, Yi
    Li, Ying
    Liu, Chunping
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 318 - 332
  • [24] Coarse-to-fine dual-level attention for video-text cross modal retrieval
    Jin, Ming
    Zhang, Huaxiang
    Zhu, Lei
    Sun, Jiande
    Liu, Li
    KNOWLEDGE-BASED SYSTEMS, 2022, 242
  • [25] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Zhou, Tongqing
    Zhao, Shan
    Cai, Zhiping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497
  • [26] HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
    Liu, Song
    Fan, Haoqi
    Qian, Shengsheng
    Chen, Yiru
    Ding, Wenkui
    Wang, Zhongyuan
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11895 - 11905
  • [27] Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval
    Lai, Huakai
    Yang, Wenfei
    Zhang, Tianzhu
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 12019 - 12031
  • [28] Video-Text Pre-training with Learned Regions for Retrieval
    Yan, Rui
    Shou, Mike Zheng
    Ge, Yixiao
    Wang, Jinpeng
    Lin, Xudong
    Cai, Guanyu
    Tang, Jinhui
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3100 - 3108
  • [29] Robust Video-Text Retrieval Via Noisy Pair Calibration
    Zhang, Huaiwen
    Yang, Yang
    Qi, Fan
    Qian, Shengsheng
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8632 - 8645
  • [30] Expert-guided contrastive learning for video-text retrieval
    Lee, Jewook
    Lee, Pilhyeon
    Park, Sungho
    Byun, Hyeran
    NEUROCOMPUTING, 2023, 536 : 50 - 58