Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引：6

作者：

Wang, Haoran ^{[1
]}

Xu, Di ^{[2
]}

He, Dongliang ^{[1
]}

Li, Fu ^{[1
]}

Ji, Zhong ^{[3
]}

Han, Jungong ^{[4
]}

Ding, Errui ^{[1
]}

机构：

[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video-Text Retrieval; High-level Semantics; Vision-language Understanding;

D O I：

10.1145/3503161.3548010

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

引用

页码：4887 / 4898

页数：12

共 50 条

[31] SEMANTIC-PRESERVING METRIC LEARNING FOR VIDEO-TEXT RETRIEVAL
Choo, Sungkwon
Ha, Seong Jong
Lee, Joonsoo
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2388 - 2392
[32] STACKED CONVOLUTIONAL DEEP ENCODING NETWORK FOR VIDEO-TEXT RETRIEVAL
Zhao, Rui
Zheng, Kecheng
Zha, Zheng-jun
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[33] Improving Transformer with Dynamic Convolution and Shortcut for Video-Text Retrieval
Liu, Zhi
Cai, Jincen
Zhang, Mengmeng
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (07): : 2407 - 2424
[34] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
Nian, Fudong
Ding, Ling
Hu, Yuxia
Gu, Yanhong
MATHEMATICS, 2022, 10 (18)
[35] INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK
Liu, Liu
Wang, Wenzhe
Zhang, Zhijie
Zhang, Mengdan
Peng, Pai
Sun, Xing
2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,
[36] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Wang, Ziyang
Sung, Yi-Lin
Cheng, Feng
Bertasius, Gedas
Bansal, Mohit
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
[37] Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval
Lin, Chengzhi
Wu, Ancong
Liang, Junwei
Zhang, Jun
Ge, Wenhang
Zheng, Wei-Shi
Shen, Chunhua
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[38] Using high-level semantic features in video retrieval
Zheng, Wujie
Li, Jianmin
Si, Zhangzhang
Lin, Fuzong
Zhang, Bo
IMAGE AND VIDEO RETRIEVAL, PROCEEDINGS, 2006, 4071 : 370 - 379
[39] High-level representation sketch for video event retrieval
Yu ZHANG
Xiaowu CHEN
Liang LIN
Changqun XIA
Dongqing ZOU
ScienceChina(InformationSciences), 2016, 59 (07) : 158 - 172
[40] High-level representation sketch for video event retrieval
Yu Zhang
Xiaowu Chen
Liang Lin
Changqun Xia
Dongqing Zou
Science China Information Sciences, 2016, 59

← 1 2 3 4 5 →