CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval

被引:2
|
作者
Gao, Yizhao [1 ]
Lu, Zhiwu [1 ]
机构
[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video-text retrieval; meta-learning; representation learning;
D O I
10.1145/3591106.3592238
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-text retrieval has drawn great attention due to the prosperity of online video contents. Most existing methods extract the video embeddings by densely sampling abundant (generally dozens of) video clips, which acquires tremendous computational cost. To reduce the resource consumption, recent works propose to sparsely sample fewer clips from each raw video with a narrow time span. However, they still struggle to learn a reliable video representation with such locally sampled video clips, especially when testing on cross-dataset setting. In this work, to overcome this problem, we sparsely and globally (with wide time span) sample a handful of video clips from each raw video, which can be regarded as different samples of a pseudo video class (i.e., each raw video denotes a pseudo video class). From such viewpoint, we propose a novel Cross-Modal Meta-Transformer (CMMT) model that can be trained in a meta-learning paradigm. Concretely, in each training step, we conduct a cross-modal fine-grained classification task where the text queries are classified with pseudo video class prototypes (each has aggregated all sampled video clips per pseudo video class). Since each classification task is defined with different/new videos (by simulating the evaluation setting), this task-based meta-learning process enables our model to generalize well on new tasks and thus learn generalizable video/text representations. To further enhance the generalizability of our model, we induce a token-aware adaptive Transformer module to dynamically update our model (prototypes) for each individual text query. Extensive experiments on three benchmarks show that our model achieves new state-of-the-art results in cross-dataset video-text retrieval, demonstrating that it has more generalizability in video-text retrieval. Importantly, we find that our new meta-learning paradigm indeed brings improvements under both cross-dataset and in-dataset retrieval settings.
引用
收藏
页码:76 / 84
页数:9
相关论文
共 50 条
  • [21] Masking Modalities for Cross-modal Video Retrieval
    Gabeur, Valentin
    Nagrani, Arsha
    Sun, Chen
    Alahari, Karteek
    Schmid, Cordelia
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2111 - 2120
  • [22] Cross-modal Embeddings for Video and Audio Retrieval
    Suris, Didac
    Duarte, Amanda
    Salvador, Amaia
    Torres, Jordi
    Giro-i-Nieto, Xavier
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 711 - 716
  • [23] Cross-Modal and Hierarchical Modeling of Video and Text
    Zhang, Bowen
    Hu, Hexiang
    Sha, Fei
    COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 : 385 - 401
  • [24] Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
    Han, Ning
    Chen, Jingjing
    Xiao, Guangyi
    Zhang, Hao
    Zeng, Yawen
    Chen, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3826 - 3834
  • [25] Spatial-Temporal Graphs for Cross-Modal Text2Video Retrieval
    Song, Xue
    Chen, Jingjing
    Wu, Zuxuan
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2914 - 2923
  • [26] Cross-Modal Coherence for Text-to-Image Retrieval
    Alikhani, Malihe
    Han, Fangda
    Ravi, Hareesh
    Kapadia, Mubbasir
    Pavlovic, Vladimir
    Stone, Matthew
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10427 - 10435
  • [27] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
    Gorti, Satya Krishna
    Vouitsis, Noel
    Ma, Junwei
    Golestan, Keyvan
    Volkovs, Maksims
    Garg, Animesh
    Yu, Guangwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4996 - 5005
  • [28] Cross-Modal Interaction Network for Video Moment Retrieval
    Ping, Shen
    Jiang, Xiao
    Tian, Zean
    Cao, Ronghui
    Chi, Weiming
    Yang, Shenghong
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (08)
  • [29] Video-Based Cross-Modal Recipe Retrieval
    Cao, Da
    Yu, Zhiwang
    Zhang, Hanling
    Fang, Jiansheng
    Nie, Liqiang
    Tian, Qi
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1685 - 1693
  • [30] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
    Wang, Zhiwen
    Zhang, Donglin
    Hu, Zhikai
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)