CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval

被引：2

作者：

Gao, Yizhao ^{[1
]}

Lu, Zhiwu ^{[1
]}

机构：

[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Video-text retrieval; meta-learning; representation learning;

D O I：

10.1145/3591106.3592238

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video-text retrieval has drawn great attention due to the prosperity of online video contents. Most existing methods extract the video embeddings by densely sampling abundant (generally dozens of) video clips, which acquires tremendous computational cost. To reduce the resource consumption, recent works propose to sparsely sample fewer clips from each raw video with a narrow time span. However, they still struggle to learn a reliable video representation with such locally sampled video clips, especially when testing on cross-dataset setting. In this work, to overcome this problem, we sparsely and globally (with wide time span) sample a handful of video clips from each raw video, which can be regarded as different samples of a pseudo video class (i.e., each raw video denotes a pseudo video class). From such viewpoint, we propose a novel Cross-Modal Meta-Transformer (CMMT) model that can be trained in a meta-learning paradigm. Concretely, in each training step, we conduct a cross-modal fine-grained classification task where the text queries are classified with pseudo video class prototypes (each has aggregated all sampled video clips per pseudo video class). Since each classification task is defined with different/new videos (by simulating the evaluation setting), this task-based meta-learning process enables our model to generalize well on new tasks and thus learn generalizable video/text representations. To further enhance the generalizability of our model, we induce a token-aware adaptive Transformer module to dynamically update our model (prototypes) for each individual text query. Extensive experiments on three benchmarks show that our model achieves new state-of-the-art results in cross-dataset video-text retrieval, demonstrating that it has more generalizability in video-text retrieval. Importantly, we find that our new meta-learning paradigm indeed brings improvements under both cross-dataset and in-dataset retrieval settings.

引用

页码：76 / 84

页数：9

共 50 条

[41] MULTI-SCALE INTERACTIVE TRANSFORMER FOR REMOTE SENSING CROSS-MODAL IMAGE-TEXT RETRIEVAL
Wang, Yijing
Ma, Jingjing
Li, Mingteng
Tang, Xu
Han, Xiao
Jiao, Licheng
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 839 - 842
[42] Deep learning for video-text retrieval: a review
Zhu, Cunjuan
Jia, Qi
Chen, Wei
Guo, Yanming
Liu, Yu
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
[43] Progressive Semantic Matching for Video-Text Retrieval
Liu, Hongying
Luo, Ruyi
Shang, Fanhua
Niu, Mantang
Liu, Yuanyuan
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5083 - 5091
[44] ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Cheng, Mengjun
Sun, Yipeng
Wang, Longchao
Zhu, Xiongwei
Yao, Kun
Chen, Jie
Song, Guoli
Han, Junyu
Liu, Jingtuo
Ding, Errui
Wang, Jingdong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5174 - 5183
[45] Cross-modal Image-Text Retrieval with Multitask Learning
Luo, Junyu
Shen, Ying
Ao, Xiang
Zhao, Zhou
Yang, Min
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2309 - 2312
[46] CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL
Chen, Mingliang
Zhang, Weimin
Ren, Yurui
Li, Ge
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 386 - 390
[47] Cross-Modal Image-Text Retrieval with Semantic Consistency
Chen, Hui
Ding, Guiguang
Lin, Zijin
Zhao, Sicheng
Han, Jungong
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
[48] On Metric Learning for Audio-Text Cross-Modal Retrieval
Mei, Xinhao
Liu, Xubo
Sun, Jianyuan
Plumbley, Mark
Wang, Wenwu
INTERSPEECH 2022, 2022, : 4142 - 4146
[49] StacMR: Scene-Text Aware Cross-Modal Retrieval
Mafla, Andres
Rezende, Rafael S.
Gomez, Lluis
Larlus, Diane
Karatzas, Dimosthenis
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 2219 - 2229
[50] Rethinking Benchmarks for Cross-modal Image-text Retrieval
Chen, Weijing
Yao, Linli
Jin, Qin
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1241 - 1251

← 1 2 3 4 5 →