Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

被引:7
|
作者
Shukor, Mustafa [1 ]
Couairon, Guillaume [1 ,2 ]
Grechka, Asya [1 ,3 ]
Cord, Matthieu [1 ,4 ]
机构
[1] Sorbonne Univ, Paris, France
[2] Meta AI, New York, NY USA
[3] Meero, Paris, France
[4] Valeoai, Paris, France
关键词
D O I
10.1109/CVPRW56347.2022.00503
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval) that exploits the interaction between modalities in a novel regularization scheme, while using only unimodal encoders at test time for efficient retrieval. We also capture the intra-dependencies between recipe entities with a dedicated recipe encoder, and propose new variants of triplet losses with dynamic margins that adapt to the difficulty of the task. Finally, we leverage the power of the recent Vision and Language Pretraining (VLP) models such as CLIP for the image encoder. Our approach outperforms existing approaches by a large margin on the Recipe1M dataset. Specifically, we achieve absolute improvements of 8.1 % (72.6 R@1) and +10.9 % (44.6 R@1) on the 1k and 10k test sets respectively. The code is available here:https://github.com/mshukor/TFood.
引用
收藏
页码:4566 / 4577
页数:12
相关论文
共 50 条
  • [1] Multimodal Encoders for Food-Oriented Cross-Modal Retrieval
    Chen, Ying
    Zhou, Dong
    Li, Lin
    Han, Jun-mei
    [J]. WEB AND BIG DATA, APWEB-WAIM 2021, PT II, 2021, 12859 : 253 - 266
  • [2] Multimodal adversarial network for cross-modal retrieval
    Hu, Peng
    Peng, Dezhong
    Wang, Xu
    Xiang, Yong
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 180 : 38 - 50
  • [3] Multimodal Graph Learning for Cross-Modal Retrieval
    Xie, Jingyou
    Zhao, Zishuo
    Lin, Zhenzhou
    Shen, Ying
    [J]. PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 145 - 153
  • [4] Cross-lingual Cross-modal Pretraining for Multimodal Retrieval
    Fei, Hongliang
    Yu, Tan
    Li, Ping
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3644 - 3650
  • [5] Deep Multimodal Transfer Learning for Cross-Modal Retrieval
    Zhen, Liangli
    Hu, Peng
    Peng, Xi
    Goh, Rick Siow Mong
    Zhou, Joey Tianyi
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (02) : 798 - 810
  • [6] Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis
    Hu, Xuming
    Guo, Zhijiang
    Teng, Zhiyang
    King, Irwin
    Yu, Philip S.
    [J]. 61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 303 - 311
  • [7] Scalable Deep Multimodal Learning for Cross-Modal Retrieval
    Hu, Peng
    Zhen, Liangli
    Peng, Dezhong
    Liu, Pei
    [J]. PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 635 - 644
  • [8] Multimodal Multiclass Boosting and its Application to Cross-modal Retrieval
    Wang, Shixun
    Dou, Zhi
    Chen, Deng
    Yu, Hairong
    Li, Yuan
    Pan, Peng
    [J]. NEUROCOMPUTING, 2019, 357 : 11 - 23
  • [9] Cross-Modal Retrieval using Random Multimodal Deep Learning
    Somasekar, Hemanth
    Naveen, Kavya
    [J]. JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2019, 14 (02): : 185 - 200
  • [10] Deep supervised multimodal semantic autoencoder for cross-modal retrieval
    Tian, Yu
    Yang, Wenjing
    Liu, Qingsong
    Yang, Qiong
    [J]. COMPUTER ANIMATION AND VIRTUAL WORLDS, 2020, 31 (4-5)