INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK

被引：0

作者：

Liu, Liu ^{[1
]}

Wang, Wenzhe ^{[2
]}

Zhang, Zhijie ^{[1
]}

Zhang, Mengdan ^{[3
]}

Peng, Pai ^{[3
]}

Sun, Xing ^{[3
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[2] Zhejiang Univ, Hangzhou, Peoples R China

[3] Tencent, Youtu Lab, Shenzhen, Peoples R China

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW) | 2021年

关键词：

Video-text retrieval; multi-modal transformer; hierarchical alignment;

D O I：

10.1109/ICMEW53276.2021.9455971

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. Recent researches handle different issues of this task such as exploiting multi-modal video cues, hierarchical reasoning, and learning pre-trained models. The implementations of these approaches vary a lot, which brings difficulty for the further research. Therefore, in this paper, we provide a unified video-text retrieval framework that has following features: 1) a modular design for easy modification of different structures of deep learning models; 2) training and test pipelines of the state-of-the-art (SOTA) models that leverage hierarchy cues and interactions between different levels of granularity and different video modalities; 3) support for various benchmark datasets; 4) demo exhibitions and well tested and documented. We hope our unified framework useful and efficient for the further research.

引用

页数：2

共 50 条

[1] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
Nian, Fudong
Ding, Ling
Hu, Yuxia
Gu, Yanhong
[J]. MATHEMATICS, 2022, 10 (18)
[2] A Framework for Video-Text Retrieval with Noisy Supervision
Vaseqi, Zahra
Fan, Pengnan
Clark, James
Levine, Martin
[J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 373 - 383
[3] Multi-event Video-Text Retrieval
Zhang, Gengyuan
Ren, Jisen
Gu, Jindong
Tresp, Volker
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
[4] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Wang, Ziyang
Sung, Yi-Lin
Cheng, Feng
Bertasius, Gedas
Bansal, Mohit
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
[5] Boosting Video-Text Retrieval with Explicit High-Level Semantics
Wang, Haoran
Xu, Di
He, Dongliang
Li, Fu
Ji, Zhong
Han, Jungong
Ding, Errui
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4887 - 4898
[6] CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
Tu R.
Mao X.
Kong W.
Cai C.
Zhao W.
Wang H.
Huang H.
[J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2169 - 2179
[7] Technological innovation systems and the multi-level perspective: Towards an integrated framework
Markard, Jochen
Truffer, Bernhard
[J]. RESEARCH POLICY, 2008, 37 (04) : 596 - 615
[8] Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
Chen, Yizhen
Wang, Jie
Lin, Lijian
Qi, Zhongang
Ma, Jin
Shan, Ying
[J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 396 - 404
[9] A multi-level framework for video shot structuring
Zhai, Y
Shah, M
[J]. IMAGE ANALYSIS AND RECOGNITION, 2005, 3656 : 167 - 173
[10] Coarse-to-fine dual-level attention for video-text cross modal retrieval
Jin, Ming
Zhang, Huaxiang
Zhu, Lei
Sun, Jiande
Liu, Li
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 242

← 1 2 3 4 5 →