Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

被引:7
|
作者
Shukor, Mustafa [1 ]
Couairon, Guillaume [1 ,2 ]
Grechka, Asya [1 ,3 ]
Cord, Matthieu [1 ,4 ]
机构
[1] Sorbonne Univ, Paris, France
[2] Meta AI, New York, NY USA
[3] Meero, Paris, France
[4] Valeoai, Paris, France
关键词
D O I
10.1109/CVPRW56347.2022.00503
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval) that exploits the interaction between modalities in a novel regularization scheme, while using only unimodal encoders at test time for efficient retrieval. We also capture the intra-dependencies between recipe entities with a dedicated recipe encoder, and propose new variants of triplet losses with dynamic margins that adapt to the difficulty of the task. Finally, we leverage the power of the recent Vision and Language Pretraining (VLP) models such as CLIP for the image encoder. Our approach outperforms existing approaches by a large margin on the Recipe1M dataset. Specifically, we achieve absolute improvements of 8.1 % (72.6 R@1) and +10.9 % (44.6 R@1) on the 1k and 10k test sets respectively. The code is available here:https://github.com/mshukor/TFood.
引用
收藏
页码:4566 / 4577
页数:12
相关论文
共 50 条
  • [21] Adversarial Cross-Modal Retrieval
    Wang, Bokun
    Yang, Yang
    Xu, Xing
    Hanjalic, Alan
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162
  • [22] CHEF: Cross-Modal Hierarchical Embeddings for Food Domain Retrieval
    Pham, Hai X.
    Guerrero, Ricardo
    Li, Jiatong
    Pavlovic, Vladimir
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2423 - 2430
  • [23] HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
    Zhang, Chengyuan
    Song, Jiayu
    Zhu, Xiaofeng
    Zhu, Lei
    Zhang, Shichao
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
  • [24] Semantic Consistency Cross-Modal Retrieval With Semi-Supervised Graph Regularization
    Xu, Gongwen
    Li, Xiaomei
    Zhang, Zhijun
    [J]. IEEE ACCESS, 2020, 8 : 14278 - 14288
  • [25] Multi-Kernel Supervised Hashing with Graph Regularization for Cross-Modal Retrieval
    Zhu, Ming
    Miao, Huanghui
    Tang, Jun
    [J]. 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2717 - 2722
  • [26] A cross-modal crowd counting method combining CNN and cross-modal transformer
    Zhang, Shihui
    Wang, Wei
    Zhao, Weibo
    Wang, Lei
    Li, Qunpeng
    [J]. IMAGE AND VISION COMPUTING, 2023, 129
  • [27] VLDeformer: Vision-Language Decomposed Transformer for fast cross-modal retrieval
    Zhang, Lisai
    Wu, Hongfa
    Chen, Qingcai
    Deng, Yimeng
    Siebert, Joanna
    Li, Zhonghua
    Han, Yunpeng
    Kong, Dejiang
    Cao, Zhao
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 252
  • [28] Distillation-Based Hashing Transformer for Cross-Modal Vessel Image Retrieval
    Guo, Jiaen
    Guan, Xin
    Liu, Ying
    Lu, Yu
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
  • [29] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
    Gao, Yizhao
    Lu, Zhiwu
    [J]. PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84
  • [30] Multimodal Discriminative Binary Embedding for Large-Scale Cross-Modal Retrieval
    Wang, Di
    Gao, Xinbo
    Wang, Xiumei
    He, Lihuo
    Yuan, Bo
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (10) : 4540 - 4554