CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval

被引:2
|
作者
Gao, Yizhao [1 ]
Lu, Zhiwu [1 ]
机构
[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video-text retrieval; meta-learning; representation learning;
D O I
10.1145/3591106.3592238
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-text retrieval has drawn great attention due to the prosperity of online video contents. Most existing methods extract the video embeddings by densely sampling abundant (generally dozens of) video clips, which acquires tremendous computational cost. To reduce the resource consumption, recent works propose to sparsely sample fewer clips from each raw video with a narrow time span. However, they still struggle to learn a reliable video representation with such locally sampled video clips, especially when testing on cross-dataset setting. In this work, to overcome this problem, we sparsely and globally (with wide time span) sample a handful of video clips from each raw video, which can be regarded as different samples of a pseudo video class (i.e., each raw video denotes a pseudo video class). From such viewpoint, we propose a novel Cross-Modal Meta-Transformer (CMMT) model that can be trained in a meta-learning paradigm. Concretely, in each training step, we conduct a cross-modal fine-grained classification task where the text queries are classified with pseudo video class prototypes (each has aggregated all sampled video clips per pseudo video class). Since each classification task is defined with different/new videos (by simulating the evaluation setting), this task-based meta-learning process enables our model to generalize well on new tasks and thus learn generalizable video/text representations. To further enhance the generalizability of our model, we induce a token-aware adaptive Transformer module to dynamically update our model (prototypes) for each individual text query. Extensive experiments on three benchmarks show that our model achieves new state-of-the-art results in cross-dataset video-text retrieval, demonstrating that it has more generalizability in video-text retrieval. Importantly, we find that our new meta-learning paradigm indeed brings improvements under both cross-dataset and in-dataset retrieval settings.
引用
收藏
页码:76 / 84
页数:9
相关论文
共 50 条
  • [31] Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval
    Shukor, Mustafa
    Couairon, Guillaume
    Grechka, Asya
    Cord, Matthieu
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4566 - 4577
  • [32] Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval
    Feng, Zerun
    Zeng, Zhimin
    Guo, Caili
    Li, Zheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) : 1438 - 1453
  • [33] VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
    Huang, Siteng
    Gong, Biao
    Pan, Yulin
    Jiang, Jianwen
    Lv, Yiliang
    Li, Yuyuan
    Wang, Donglin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6565 - 6574
  • [34] Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
    Han, Ning
    Chen, Jingjing
    Zhang, Hao
    Wang, Huanwen
    Chen, Hao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (02)
  • [35] A Framework for Video-Text Retrieval with Noisy Supervision
    Vaseqi, Zahra
    Fan, Pengnan
    Clark, James
    Levine, Martin
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 373 - 383
  • [36] Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
    Chen, Yizhen
    Wang, Jie
    Lin, Lijian
    Qi, Zhongang
    Ma, Jin
    Shan, Ying
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 396 - 404
  • [37] Multi-event Video-Text Retrieval
    Zhang, Gengyuan
    Ren, Jisen
    Gu, Jindong
    Tresp, Volker
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
  • [38] A NOVEL CONVOLUTIONAL ARCHITECTURE FOR VIDEO-TEXT RETRIEVAL
    Li, Zheng
    Guo, Caili
    Yang, Bo
    Feng, Zerun
    Zhang, Hao
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [39] Video-text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network
    Lv, Gang
    Sun, Yining
    Nian, Fudong
    MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [40] Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval
    Tang, Xu
    Wang, Yijing
    Ma, Jingjing
    Zhang, Xiangrong
    Liu, Fang
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61