Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引：6

作者：

Wang, Haoran ^{[1
]}

Xu, Di ^{[2
]}

He, Dongliang ^{[1
]}

Li, Fu ^{[1
]}

Ji, Zhong ^{[3
]}

Han, Jungong ^{[4
]}

Ding, Errui ^{[1
]}

机构：

[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video-Text Retrieval; High-level Semantics; Vision-language Understanding;

D O I：

10.1145/3503161.3548010

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

引用

页码：4887 / 4898

页数：12

共 50 条

[1] Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval
Fang, Han
Yang, Zhifei
Zang, Xianghao
Ban, Chao
He, Zhongjiang
Sun, Hao
Zhou, Lanxiang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3847 - 3856
[2] Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval
Shi, Yaya
Liu, Haowei
Xu, Haiyang
Ma, Zongyang
Ye, Qinghao
Hu, Anwen
Yan, Ming
Zhang, Ji
Huang, Fei
Yuan, Chunfeng
Li, Bing
Hu, Weiming
Zha, Zheng-Jun
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4460 - 4470
[3] A NOVEL CONVOLUTIONAL ARCHITECTURE FOR VIDEO-TEXT RETRIEVAL
Li, Zheng
Guo, Caili
Yang, Bo
Feng, Zerun
Zhang, Hao
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[4] Deep learning for video-text retrieval: a review
Zhu, Cunjuan
Jia, Qi
Chen, Wei
Guo, Yanming
Liu, Yu
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
[5] Progressive Semantic Matching for Video-Text Retrieval
Liu, Hongying
Luo, Ruyi
Shang, Fanhua
Niu, Mantang
Liu, Yuanyuan
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5083 - 5091
[6] Multi-event Video-Text Retrieval
Zhang, Gengyuan
Ren, Jisen
Gu, Jindong
Tresp, Volker
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
[7] A Framework for Video-Text Retrieval with Noisy Supervision
Vaseqi, Zahra
Fan, Pengnan
Clark, James
Levine, Martin
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 373 - 383
[8] Deep learning for video-text retrieval: a review
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
International Journal of Multimedia Information Retrieval, 2023, 12
[9] Visual Consensus Modeling for Video-Text Retrieval
Cao, Shuqiang
Wang, Bairui
Zhang, Wei
Ma, Lin
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 167 - 175
[10] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
Ge, Yuying
Ge, Yixiao
Liu, Xihui
Wang, Jinpeng
Wu, Jianping
Shan, Ying
Qie, Xiaohu
Luo, Ping
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 691 - 708

← 1 2 3 4 5 →