Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

被引:0
|
作者
Chen, Yizhen [1 ]
Wang, Jie [1 ]
Lin, Lijian [2 ]
Qi, Zhongang [2 ]
Ma, Jin [1 ]
Shan, Ying [2 ]
机构
[1] Tencent PCG, IPS Search, Beijing, Peoples R China
[2] Tencent PCG, ARC Lab, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive experimental results demonstrate that the TABLE model is capable of achieving State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC and DiDeMo.
引用
收藏
页码:396 / 404
页数:9
相关论文
共 50 条
  • [1] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    [J]. MATHEMATICS, 2022, 10 (18)
  • [2] Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment
    Che, Zhanbin
    Guo, Huaili
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 303 - 311
  • [3] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
    Chen, Lei
    Deng, Zhen
    Liu, Libo
    Yin, Shibai
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
  • [4] Survey on Video-Text Cross-Modal Retrieval
    Chen, Lei
    Xi, Yimeng
    Liu, Libo
    [J]. Computer Engineering and Applications, 2024, 60 (04) : 1 - 20
  • [5] HANet: Hierarchical Alignment Networks for Video-Text Retrieval
    Wu, Peng
    He, Xiangteng
    Tang, Mingqian
    Lv, Yiliang
    Liu, Jing
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3518 - 3527
  • [6] Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment
    Wang, Wenzhe
    Zhang, Mengdan
    Chen, Runnan
    Cai, Guanyu
    Zhou, Penghao
    Peng, Pai
    Guo, Xiaowei
    Wu, Jian
    Sun, Xing
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1113 - 1121
  • [7] Multi-event Video-Text Retrieval
    Zhang, Gengyuan
    Ren, Jisen
    Gu, Jindong
    Tresp, Volker
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
  • [8] Video-text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network
    Lv, Gang
    Sun, Yining
    Nian, Fudong
    [J]. MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [9] Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
    Hao, Xiaoshuai
    Zhang, Wanqian
    Wu, Dayan
    Zhu, Fei
    Li, Bo
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18962 - 18972
  • [10] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
    Wang, Ziyang
    Sung, Yi-Lin
    Cheng, Feng
    Bertasius, Gedas
    Bansal, Mohit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815