Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

被引:7
|
作者
Shukor, Mustafa [1 ]
Couairon, Guillaume [1 ,2 ]
Grechka, Asya [1 ,3 ]
Cord, Matthieu [1 ,4 ]
机构
[1] Sorbonne Univ, Paris, France
[2] Meta AI, New York, NY USA
[3] Meero, Paris, France
[4] Valeoai, Paris, France
关键词
D O I
10.1109/CVPRW56347.2022.00503
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval) that exploits the interaction between modalities in a novel regularization scheme, while using only unimodal encoders at test time for efficient retrieval. We also capture the intra-dependencies between recipe entities with a dedicated recipe encoder, and propose new variants of triplet losses with dynamic margins that adapt to the difficulty of the task. Finally, we leverage the power of the recent Vision and Language Pretraining (VLP) models such as CLIP for the image encoder. Our approach outperforms existing approaches by a large margin on the Recipe1M dataset. Specifically, we achieve absolute improvements of 8.1 % (72.6 R@1) and +10.9 % (44.6 R@1) on the 1k and 10k test sets respectively. The code is available here:https://github.com/mshukor/TFood.
引用
收藏
页码:4566 / 4577
页数:12
相关论文
共 50 条
  • [41] Geometric Matching for Cross-Modal Retrieval
    Wang, Zheng
    Gao, Zhenwei
    Yang, Yang
    Wang, Guoqing
    Jiao, Chengbo
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [42] CROSS-MODAL RETRIEVAL WITH NOISY LABELS
    Mandal, Devraj
    Biswas, Soma
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2326 - 2330
  • [43] Cross-Modal Retrieval for CPSS Data
    Zhong, Fangming
    Wang, Guangze
    Chen, Zhikui
    Xia, Feng
    Min, Geyong
    [J]. IEEE ACCESS, 2020, 8 : 16689 - 16701
  • [44] A Graph Model for Cross-modal Retrieval
    Wang, Shixun
    Pan, Peng
    Lu, Yansheng
    [J]. PROCEEDINGS OF 3RD INTERNATIONAL CONFERENCE ON MULTIMEDIA TECHNOLOGY (ICMT-13), 2013, 84 : 1090 - 1097
  • [45] Hashing for Cross-Modal Similarity Retrieval
    Liu, Yao
    Yuan, Yanhong
    Huang, Qiaoli
    Huang, Zhixing
    [J]. 2015 11TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG), 2015, : 1 - 8
  • [46] Deep Supervised Cross-modal Retrieval
    Zhen, Liangli
    Hu, Peng
    Wang, Xu
    Peng, Dezhong
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10386 - 10395
  • [47] Cross-modal retrieval with dual optimization
    Qingzhen Xu
    Shuang Liu
    Han Qiao
    Miao Li
    [J]. Multimedia Tools and Applications, 2023, 82 : 7141 - 7157
  • [48] Learning DALTS for cross-modal retrieval
    Yu, Zheng
    Wang, Wenmin
    [J]. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2019, 4 (01) : 9 - 16
  • [49] Semantics Disentangling for Cross-Modal Retrieval
    Wang, Zheng
    Xu, Xing
    Wei, Jiwei
    Xie, Ning
    Yang, Yang
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 2226 - 2237
  • [50] Continual learning in cross-modal retrieval
    Wang, Kai
    Herranz, Luis
    van de Weijer, Joost
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3623 - 3633