Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引:6
|
作者
Wang, Haoran [1 ]
Xu, Di [2 ]
He, Dongliang [1 ]
Li, Fu [1 ]
Ji, Zhong [3 ]
Han, Jungong [4 ]
Ding, Errui [1 ]
机构
[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales
关键词
Video-Text Retrieval; High-level Semantics; Vision-language Understanding;
D O I
10.1145/3503161.3548010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.
引用
收藏
页码:4887 / 4898
页数:12
相关论文
共 50 条
  • [31] SEMANTIC-PRESERVING METRIC LEARNING FOR VIDEO-TEXT RETRIEVAL
    Choo, Sungkwon
    Ha, Seong Jong
    Lee, Joonsoo
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2388 - 2392
  • [32] STACKED CONVOLUTIONAL DEEP ENCODING NETWORK FOR VIDEO-TEXT RETRIEVAL
    Zhao, Rui
    Zheng, Kecheng
    Zha, Zheng-jun
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [33] Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval
    Liu, Zhi
    Cai, Jincen
    Zhang, Mengmeng
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (07): : 2407 - 2424
  • [34] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    MATHEMATICS, 2022, 10 (18)
  • [35] INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK
    Liu, Liu
    Wang, Wenzhe
    Zhang, Zhijie
    Zhang, Mengdan
    Peng, Pai
    Sun, Xing
    2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,
  • [36] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
    Wang, Ziyang
    Sung, Yi-Lin
    Cheng, Feng
    Bertasius, Gedas
    Bansal, Mohit
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
  • [37] Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval
    Lin, Chengzhi
    Wu, Ancong
    Liang, Junwei
    Zhang, Jun
    Ge, Wenhang
    Zheng, Wei-Shi
    Shen, Chunhua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [38] Using high-level semantic features in video retrieval
    Zheng, Wujie
    Li, Jianmin
    Si, Zhangzhang
    Lin, Fuzong
    Zhang, Bo
    IMAGE AND VIDEO RETRIEVAL, PROCEEDINGS, 2006, 4071 : 370 - 379
  • [39] High-level representation sketch for video event retrieval
    Yu ZHANG
    Xiaowu CHEN
    Liang LIN
    Changqun XIA
    Dongqing ZOU
    ScienceChina(InformationSciences), 2016, 59 (07) : 158 - 172
  • [40] High-level representation sketch for video event retrieval
    Yu Zhang
    Xiaowu Chen
    Liang Lin
    Changqun Xia
    Dongqing Zou
    Science China Information Sciences, 2016, 59