CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval

被引:2
|
作者
Gao, Yizhao [1 ]
Lu, Zhiwu [1 ]
机构
[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video-text retrieval; meta-learning; representation learning;
D O I
10.1145/3591106.3592238
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-text retrieval has drawn great attention due to the prosperity of online video contents. Most existing methods extract the video embeddings by densely sampling abundant (generally dozens of) video clips, which acquires tremendous computational cost. To reduce the resource consumption, recent works propose to sparsely sample fewer clips from each raw video with a narrow time span. However, they still struggle to learn a reliable video representation with such locally sampled video clips, especially when testing on cross-dataset setting. In this work, to overcome this problem, we sparsely and globally (with wide time span) sample a handful of video clips from each raw video, which can be regarded as different samples of a pseudo video class (i.e., each raw video denotes a pseudo video class). From such viewpoint, we propose a novel Cross-Modal Meta-Transformer (CMMT) model that can be trained in a meta-learning paradigm. Concretely, in each training step, we conduct a cross-modal fine-grained classification task where the text queries are classified with pseudo video class prototypes (each has aggregated all sampled video clips per pseudo video class). Since each classification task is defined with different/new videos (by simulating the evaluation setting), this task-based meta-learning process enables our model to generalize well on new tasks and thus learn generalizable video/text representations. To further enhance the generalizability of our model, we induce a token-aware adaptive Transformer module to dynamically update our model (prototypes) for each individual text query. Extensive experiments on three benchmarks show that our model achieves new state-of-the-art results in cross-dataset video-text retrieval, demonstrating that it has more generalizability in video-text retrieval. Importantly, we find that our new meta-learning paradigm indeed brings improvements under both cross-dataset and in-dataset retrieval settings.
引用
收藏
页码:76 / 84
页数:9
相关论文
共 50 条
  • [41] MULTI-SCALE INTERACTIVE TRANSFORMER FOR REMOTE SENSING CROSS-MODAL IMAGE-TEXT RETRIEVAL
    Wang, Yijing
    Ma, Jingjing
    Li, Mingteng
    Tang, Xu
    Han, Xiao
    Jiao, Licheng
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 839 - 842
  • [42] Deep learning for video-text retrieval: a review
    Zhu, Cunjuan
    Jia, Qi
    Chen, Wei
    Guo, Yanming
    Liu, Yu
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
  • [43] Progressive Semantic Matching for Video-Text Retrieval
    Liu, Hongying
    Luo, Ruyi
    Shang, Fanhua
    Niu, Mantang
    Liu, Yuanyuan
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5083 - 5091
  • [44] ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
    Cheng, Mengjun
    Sun, Yipeng
    Wang, Longchao
    Zhu, Xiongwei
    Yao, Kun
    Chen, Jie
    Song, Guoli
    Han, Junyu
    Liu, Jingtuo
    Ding, Errui
    Wang, Jingdong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5174 - 5183
  • [45] Cross-modal Image-Text Retrieval with Multitask Learning
    Luo, Junyu
    Shen, Ying
    Ao, Xiang
    Zhao, Zhou
    Yang, Min
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2309 - 2312
  • [46] CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL
    Chen, Mingliang
    Zhang, Weimin
    Ren, Yurui
    Li, Ge
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 386 - 390
  • [47] Cross-Modal Image-Text Retrieval with Semantic Consistency
    Chen, Hui
    Ding, Guiguang
    Lin, Zijin
    Zhao, Sicheng
    Han, Jungong
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
  • [48] On Metric Learning for Audio-Text Cross-Modal Retrieval
    Mei, Xinhao
    Liu, Xubo
    Sun, Jianyuan
    Plumbley, Mark
    Wang, Wenwu
    INTERSPEECH 2022, 2022, : 4142 - 4146
  • [49] StacMR: Scene-Text Aware Cross-Modal Retrieval
    Mafla, Andres
    Rezende, Rafael S.
    Gomez, Lluis
    Larlus, Diane
    Karatzas, Dimosthenis
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 2219 - 2229
  • [50] Rethinking Benchmarks for Cross-modal Image-text Retrieval
    Chen, Weijing
    Yao, Linli
    Jin, Qin
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1241 - 1251