INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK

被引:0
|
作者
Liu, Liu [1 ]
Wang, Wenzhe [2 ]
Zhang, Zhijie [1 ]
Zhang, Mengdan [3 ]
Peng, Pai [3 ]
Sun, Xing [3 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Zhejiang Univ, Hangzhou, Peoples R China
[3] Tencent, Youtu Lab, Shenzhen, Peoples R China
关键词
Video-text retrieval; multi-modal transformer; hierarchical alignment;
D O I
10.1109/ICMEW53276.2021.9455971
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. Recent researches handle different issues of this task such as exploiting multi-modal video cues, hierarchical reasoning, and learning pre-trained models. The implementations of these approaches vary a lot, which brings difficulty for the further research. Therefore, in this paper, we provide a unified video-text retrieval framework that has following features: 1) a modular design for easy modification of different structures of deep learning models; 2) training and test pipelines of the state-of-the-art (SOTA) models that leverage hierarchy cues and interactions between different levels of granularity and different video modalities; 3) support for various benchmark datasets; 4) demo exhibitions and well tested and documented. We hope our unified framework useful and efficient for the further research.
引用
收藏
页数:2
相关论文
共 50 条
  • [1] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    [J]. MATHEMATICS, 2022, 10 (18)
  • [2] A Framework for Video-Text Retrieval with Noisy Supervision
    Vaseqi, Zahra
    Fan, Pengnan
    Clark, James
    Levine, Martin
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 373 - 383
  • [3] Multi-event Video-Text Retrieval
    Zhang, Gengyuan
    Ren, Jisen
    Gu, Jindong
    Tresp, Volker
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
  • [4] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
    Wang, Ziyang
    Sung, Yi-Lin
    Cheng, Feng
    Bertasius, Gedas
    Bansal, Mohit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
  • [5] Boosting Video-Text Retrieval with Explicit High-Level Semantics
    Wang, Haoran
    Xu, Di
    He, Dongliang
    Li, Fu
    Ji, Zhong
    Han, Jungong
    Ding, Errui
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4887 - 4898
  • [6] CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
    Tu R.
    Mao X.
    Kong W.
    Cai C.
    Zhao W.
    Wang H.
    Huang H.
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2169 - 2179
  • [7] Technological innovation systems and the multi-level perspective: Towards an integrated framework
    Markard, Jochen
    Truffer, Bernhard
    [J]. RESEARCH POLICY, 2008, 37 (04) : 596 - 615
  • [8] Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
    Chen, Yizhen
    Wang, Jie
    Lin, Lijian
    Qi, Zhongang
    Ma, Jin
    Shan, Ying
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 396 - 404
  • [9] A multi-level framework for video shot structuring
    Zhai, Y
    Shah, M
    [J]. IMAGE ANALYSIS AND RECOGNITION, 2005, 3656 : 167 - 173
  • [10] Coarse-to-fine dual-level attention for video-text cross modal retrieval
    Jin, Ming
    Zhang, Huaxiang
    Zhu, Lei
    Sun, Jiande
    Liu, Li
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 242