Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

被引：7

作者：

Shukor, Mustafa ^{[1
]}

Couairon, Guillaume ^{[1
,2
]}

Grechka, Asya ^{[1
,3
]}

Cord, Matthieu ^{[1
,4
]}

机构：

[1] Sorbonne Univ, Paris, France

[2] Meta AI, New York, NY USA

[3] Meero, Paris, France

[4] Valeoai, Paris, France

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 | 2022年

关键词：

D O I：

10.1109/CVPRW56347.2022.00503

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval) that exploits the interaction between modalities in a novel regularization scheme, while using only unimodal encoders at test time for efficient retrieval. We also capture the intra-dependencies between recipe entities with a dedicated recipe encoder, and propose new variants of triplet losses with dynamic margins that adapt to the difficulty of the task. Finally, we leverage the power of the recent Vision and Language Pretraining (VLP) models such as CLIP for the image encoder. Our approach outperforms existing approaches by a large margin on the Recipe1M dataset. Specifically, we achieve absolute improvements of 8.1 % (72.6 R@1) and +10.9 % (44.6 R@1) on the 1k and 10k test sets respectively. The code is available here:https://github.com/mshukor/TFood.

引用

页码：4566 / 4577

页数：12

共 50 条

[21] Adversarial Cross-Modal Retrieval
Wang, Bokun
Yang, Yang
Xu, Xing
Hanjalic, Alan
Shen, Heng Tao
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162
[22] CHEF: Cross-Modal Hierarchical Embeddings for Food Domain Retrieval
Pham, Hai X.
Guerrero, Ricardo
Li, Jiatong
Pavlovic, Vladimir
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2423 - 2430
[23] HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
Zhang, Chengyuan
Song, Jiayu
Zhu, Xiaofeng
Zhu, Lei
Zhang, Shichao
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
[24] Semantic Consistency Cross-Modal Retrieval With Semi-Supervised Graph Regularization
Xu, Gongwen
Li, Xiaomei
Zhang, Zhijun
[J]. IEEE ACCESS, 2020, 8 : 14278 - 14288
[25] Multi-Kernel Supervised Hashing with Graph Regularization for Cross-Modal Retrieval
Zhu, Ming
Miao, Huanghui
Tang, Jun
[J]. 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2717 - 2722
[26] A cross-modal crowd counting method combining CNN and cross-modal transformer
Zhang, Shihui
Wang, Wei
Zhao, Weibo
Wang, Lei
Li, Qunpeng
[J]. IMAGE AND VISION COMPUTING, 2023, 129
[27] VLDeformer: Vision-Language Decomposed Transformer for fast cross-modal retrieval
Zhang, Lisai
Wu, Hongfa
Chen, Qingcai
Deng, Yimeng
Siebert, Joanna
Li, Zhonghua
Han, Yunpeng
Kong, Dejiang
Cao, Zhao
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 252
[28] Distillation-Based Hashing Transformer for Cross-Modal Vessel Image Retrieval
Guo, Jiaen
Guan, Xin
Liu, Ying
Lu, Yu
[J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
[29] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
Gao, Yizhao
Lu, Zhiwu
[J]. PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84
[30] Multimodal Discriminative Binary Embedding for Large-Scale Cross-Modal Retrieval
Wang, Di
Gao, Xinbo
Wang, Xiumei
He, Lihuo
Yuan, Bo
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (10) : 4540 - 4554

← 1 2 3 4 5 →