Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learning

被引:0
|
作者
Zhang, Bolin [1 ]
Kyutoku, Haruya [2 ]
Doman, Keisuke [3 ]
Komamizu, Takahiro [4 ]
Ide, Ichiro [5 ]
Qian, Jiangbo [1 ]
机构
[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Zhejiang, Peoples R China
[2] Aichi Univ Technol, Fac Engn, Gamagori, Aichi, Japan
[3] Chukyo Univ, Sch Engn, Toyota, Aichi, Japan
[4] Nagoya Univ, Math & Data Sci Ctr, Nagoya, Aichi, Japan
[5] Nagoya Univ, Grad Sch Informat, Nagoya, Aichi, Japan
关键词
Cross-modal recipe retrieval; Unified text encoder; Contrastive learning;
D O I
10.1016/j.knosys.2024.112641
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal recipe retrieval is vital for transforming visual food cues into actionable cooking guidance, making culinary creativity more accessible. Existing methods separately encode the recipe Title, Ingredient, and Instruction using different text encoders, then aggregate them to obtain recipe feature, and finally match it with encoded image feature in a joint embedding space. These methods perform well but require significant computational cost. In addition, they only consider matching the entire recipe and the image but ignore the fine-grained correspondence between recipe components and the image, resulting in insufficient cross-modal interaction. To this end, we propose U nified T ext E ncoder with F ine-grained C ontrastive L earning (UTE-FCL) to achieve a simple but efficient model. Specifically, in each recipe, UTE-FCL first concatenates each of the Ingredient and Instruction texts composed of multiple sentences as a single text. Then, it connects these two concatenated texts with the original single-phrase Title to obtain the concatenated recipe. Finally, it encodes these three concatenated texts and the original Title by a Transformer-based Unified Text Encoder (UTE). This proposed structure greatly reduces the memory usage and improves the feature encoding efficiency. Further, we propose fine-grained contrastive learning objectives to capture the correspondence between recipe components and the image at Title, Ingredient, and Instruction levels by measuring the mutual information. Extensive experiments demonstrate the effectiveness of UTE-FCL compared to existing methods.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Deep Self-Supervised Hashing With Fine-Grained Similarity Mining for Cross-Modal Retrieval
    Han, Lijun
    Wang, Renlin
    Chen, Chunlei
    Zhang, Huihui
    Zhang, Yujie
    Zhang, Wenfeng
    IEEE ACCESS, 2024, 12 : 31756 - 31770
  • [42] Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval
    Zhu, Jianwei
    Li, Zhixin
    Wei, Jiahui
    Zeng, Yufei
    Ma, Huifang
    IMAGE AND VISION COMPUTING, 2022, 124
  • [43] Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval
    Zhu, Jianwei
    Li, Zhixin
    Wei, Jiahui
    Zeng, Yufei
    Ma, Huifang
    Image and Vision Computing, 2022, 124
  • [44] Soft Contrastive Cross-Modal Retrieval
    Song, Jiayu
    Hu, Yuxuan
    Zhu, Lei
    Zhang, Chengyuan
    Zhang, Jian
    Zhang, Shichao
    APPLIED SCIENCES-BASEL, 2024, 14 (05):
  • [45] Improving text-image cross-modal retrieval with contrastive loss
    Zhang, Chumeng
    Yang, Yue
    Guo, Junbo
    Jin, Guoqing
    Song, Dan
    Liu, An An
    MULTIMEDIA SYSTEMS, 2023, 29 (02) : 569 - 575
  • [46] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
    Shen, Xiaobo
    Huang, Qianxin
    Lan, Long
    Zheng, Yuhui
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
  • [47] Image-Text Cross-Modal Retrieval with Instance Contrastive Embedding
    Zeng, Ruigeng
    Ma, Wentao
    Wu, Xiaoqian
    Liu, Wei
    Liu, Jie
    ELECTRONICS, 2024, 13 (02)
  • [48] Improving text-image cross-modal retrieval with contrastive loss
    Chumeng Zhang
    Yue Yang
    Junbo Guo
    Guoqing Jin
    Dan Song
    An An Liu
    Multimedia Systems, 2023, 29 : 569 - 575
  • [49] Momentum Cross-Modal Contrastive Learning for Video Moment Retrieval
    Han, De
    Cheng, Xing
    Guo, Nan
    Ye, Xiaochun
    Rainer, Benjamin
    Priller, Peter
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5977 - 5994
  • [50] Cross-Modal Retrieval Based on Semantic Auto-Encoder and Hash Learning
    Lu, Zhu
    Fang, Deng
    Kun, Liu
    Tingting, He
    Yuanyuan, Liu
    Data Analysis and Knowledge Discovery, 2021, 5 (12) : 110 - 122