Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引：6

作者：

Wang, Haoran ^{[1
]}

Xu, Di ^{[2
]}

He, Dongliang ^{[1
]}

Li, Fu ^{[1
]}

Ji, Zhong ^{[3
]}

Han, Jungong ^{[4
]}

Ding, Errui ^{[1
]}

机构：

[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video-Text Retrieval; High-level Semantics; Vision-language Understanding;

D O I：

10.1145/3503161.3548010

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

引用

页码：4887 / 4898

页数：12

共 50 条

[21] Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
Hao, Xiaoshuai
Zhang, Wanqian
Wu, Dayan
Zhu, Fei
Li, Bo
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18962 - 18972
[22] Complementarity-Aware Space Learning for Video-Text Retrieval
Zhu, Jinkuan
Zeng, Pengpeng
Gao, Lianli
Li, Gongfu
Liao, Dongliang
Song, Jingkuan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4362 - 4374
[23] Uncertainty-Aware with Negative Samples for Video-Text Retrieval
Song, Weitao
Chen, Weiran
Xu, Jialiang
Ji, Yi
Li, Ying
Liu, Chunping
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 318 - 332
[24] Coarse-to-fine dual-level attention for video-text cross modal retrieval
Jin, Ming
Zhang, Huaxiang
Zhu, Lei
Sun, Jiande
Liu, Li
KNOWLEDGE-BASED SYSTEMS, 2022, 242
[25] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
Ma, Wentao
Chen, Qingchao
Zhou, Tongqing
Zhao, Shan
Cai, Zhiping
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497
[26] HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
Liu, Song
Fan, Haoqi
Qian, Shengsheng
Chen, Yiru
Ding, Wenkui
Wang, Zhongyuan
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11895 - 11905
[27] Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval
Lai, Huakai
Yang, Wenfei
Zhang, Tianzhu
Zhang, Yongdong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 12019 - 12031
[28] Video-Text Pre-training with Learned Regions for Retrieval
Yan, Rui
Shou, Mike Zheng
Ge, Yixiao
Wang, Jinpeng
Lin, Xudong
Cai, Guanyu
Tang, Jinhui
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3100 - 3108
[29] Robust Video-Text Retrieval Via Noisy Pair Calibration
Zhang, Huaiwen
Yang, Yang
Qi, Fan
Qian, Shengsheng
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8632 - 8645
[30] Expert-guided contrastive learning for video-text retrieval
Lee, Jewook
Lee, Pilhyeon
Park, Sungho
Byun, Hyeran
NEUROCOMPUTING, 2023, 536 : 50 - 58

← 1 2 3 4 5 →