Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引:6
|
作者
Wang, Haoran [1 ]
Xu, Di [2 ]
He, Dongliang [1 ]
Li, Fu [1 ]
Ji, Zhong [3 ]
Han, Jungong [4 ]
Ding, Errui [1 ]
机构
[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales
关键词
Video-Text Retrieval; High-level Semantics; Vision-language Understanding;
D O I
10.1145/3503161.3548010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.
引用
收藏
页码:4887 / 4898
页数:12
相关论文
共 50 条
  • [11] Joint embeddings with multimodal cues for video-text retrieval
    Niluthpol C. Mithun
    Juncheng Li
    Florian Metze
    Amit K. Roy-Chowdhury
    International Journal of Multimedia Information Retrieval, 2019, 8 : 3 - 18
  • [12] An Efficient Multimodal Aggregation Network for Video-Text Retrieval
    Liu, Zhi
    Zhao, Fangyuan
    Zhang, Mengmeng
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (10) : 1825 - 1828
  • [13] Joint embeddings with multimodal cues for video-text retrieval
    Mithun, Niluthpol C.
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 3 - 18
  • [14] Exploiting Visual Semantic Reasoning for Video-Text Retrieval
    Feng, Zerun
    Zeng, Zhimin
    Guo, Caili
    Li, Zheng
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1005 - 1011
  • [15] Animating Images to Transfer CLIP for Video-Text Retrieval
    Liu, Yu
    Chen, Huai
    Huang, Lianghua
    Chen, Di
    Wang, Bin
    Pan, Pan
    Wang, Lisheng
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1906 - 1911
  • [16] VTC: Improving Video-Text Retrieval with User Comments
    Hanu, Laura
    Thewlis, James
    Asano, Yuki M.
    Rupprecht, Christian
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 616 - 633
  • [17] Bridging Video-text Retrieval with Multiple Choice Questions
    Ge, Yuying
    Ge, Yixiao
    Liu, Xihui
    Li, Dian
    Shan, Ying
    Qie, Xiaohu
    Luo, Ping
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16146 - 16155
  • [18] Survey on Video-Text Cross-Modal Retrieval
    Chen, Lei
    Xi, Yimeng
    Liu, Libo
    Computer Engineering and Applications, 2024, 60 (04) : 1 - 20
  • [19] HANet: Hierarchical Alignment Networks for Video-Text Retrieval
    Wu, Peng
    He, Xiangteng
    Tang, Mingqian
    Lv, Yiliang
    Liu, Jing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3518 - 3527
  • [20] Adaptive Token Excitation with Negative Selection for Video-Text Retrieval
    Yu, Juntao
    Ni, Zhangkai
    Su, Taiyi
    Wang, Hanli
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 349 - 361