Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引：6

作者：

Wang, Haoran ^{[1
]}

Xu, Di ^{[2
]}

He, Dongliang ^{[1
]}

Li, Fu ^{[1
]}

Ji, Zhong ^{[3
]}

Han, Jungong ^{[4
]}

Ding, Errui ^{[1
]}

机构：

[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video-Text Retrieval; High-level Semantics; Vision-language Understanding;

D O I：

10.1145/3503161.3548010

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

引用

页码：4887 / 4898

页数：12

共 50 条

[11] Joint embeddings with multimodal cues for video-text retrieval
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
International Journal of Multimedia Information Retrieval, 2019, 8 : 3 - 18
[12] An Efficient Multimodal Aggregation Network for Video-Text Retrieval
Liu, Zhi
Zhao, Fangyuan
Zhang, Mengmeng
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (10) : 1825 - 1828
[13] Joint embeddings with multimodal cues for video-text retrieval
Mithun, Niluthpol C.
Li, Juncheng
Metze, Florian
Roy-Chowdhury, Amit K.
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 3 - 18
[14] Exploiting Visual Semantic Reasoning for Video-Text Retrieval
Feng, Zerun
Zeng, Zhimin
Guo, Caili
Li, Zheng
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1005 - 1011
[15] Animating Images to Transfer CLIP for Video-Text Retrieval
Liu, Yu
Chen, Huai
Huang, Lianghua
Chen, Di
Wang, Bin
Pan, Pan
Wang, Lisheng
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1906 - 1911
[16] VTC: Improving Video-Text Retrieval with User Comments
Hanu, Laura
Thewlis, James
Asano, Yuki M.
Rupprecht, Christian
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 616 - 633
[17] Bridging Video-text Retrieval with Multiple Choice Questions
Ge, Yuying
Ge, Yixiao
Liu, Xihui
Li, Dian
Shan, Ying
Qie, Xiaohu
Luo, Ping
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16146 - 16155
[18] Survey on Video-Text Cross-Modal Retrieval
Chen, Lei
Xi, Yimeng
Liu, Libo
Computer Engineering and Applications, 2024, 60 (04) : 1 - 20
[19] HANet: Hierarchical Alignment Networks for Video-Text Retrieval
Wu, Peng
He, Xiangteng
Tang, Mingqian
Lv, Yiliang
Liu, Jing
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3518 - 3527
[20] Adaptive Token Excitation with Negative Selection for Video-Text Retrieval
Yu, Juntao
Ni, Zhangkai
Su, Taiyi
Wang, Hanli
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 349 - 361

← 1 2 3 4 5 →