Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learning

被引:0
|
作者
Zhang, Bolin [1 ]
Kyutoku, Haruya [2 ]
Doman, Keisuke [3 ]
Komamizu, Takahiro [4 ]
Ide, Ichiro [5 ]
Qian, Jiangbo [1 ]
机构
[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Zhejiang, Peoples R China
[2] Aichi Univ Technol, Fac Engn, Gamagori, Aichi, Japan
[3] Chukyo Univ, Sch Engn, Toyota, Aichi, Japan
[4] Nagoya Univ, Math & Data Sci Ctr, Nagoya, Aichi, Japan
[5] Nagoya Univ, Grad Sch Informat, Nagoya, Aichi, Japan
关键词
Cross-modal recipe retrieval; Unified text encoder; Contrastive learning;
D O I
10.1016/j.knosys.2024.112641
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal recipe retrieval is vital for transforming visual food cues into actionable cooking guidance, making culinary creativity more accessible. Existing methods separately encode the recipe Title, Ingredient, and Instruction using different text encoders, then aggregate them to obtain recipe feature, and finally match it with encoded image feature in a joint embedding space. These methods perform well but require significant computational cost. In addition, they only consider matching the entire recipe and the image but ignore the fine-grained correspondence between recipe components and the image, resulting in insufficient cross-modal interaction. To this end, we propose U nified T ext E ncoder with F ine-grained C ontrastive L earning (UTE-FCL) to achieve a simple but efficient model. Specifically, in each recipe, UTE-FCL first concatenates each of the Ingredient and Instruction texts composed of multiple sentences as a single text. Then, it connects these two concatenated texts with the original single-phrase Title to obtain the concatenated recipe. Finally, it encodes these three concatenated texts and the original Title by a Transformer-based Unified Text Encoder (UTE). This proposed structure greatly reduces the memory usage and improves the feature encoding efficiency. Further, we propose fine-grained contrastive learning objectives to capture the correspondence between recipe components and the image at Title, Ingredient, and Instruction levels by measuring the mutual information. Extensive experiments demonstrate the effectiveness of UTE-FCL compared to existing methods.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text-Image Retrieval
    Yang, Lei
    Feng, Yong
    Zhou, Mingling
    Xiong, Xiancai
    Wang, Yongheng
    Qiang, Baohua
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2023, 32 (13)
  • [22] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
    Lu, Haoyu
    Huo, Yuqi
    Ding, Mingyu
    Fei, Nanyi
    Lu, Zhiwu
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 569 - 582
  • [23] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
    Haoyu Lu
    Yuqi Huo
    Mingyu Ding
    Nanyi Fei
    Zhiwu Lu
    Machine Intelligence Research, 2023, 20 : 569 - 582
  • [24] Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval
    Hu, Dingyi
    Jiang, Zhiguo
    Shi, Jun
    Xie, Fengying
    Wu, Kun
    Tang, Kunming
    Cao, Ming
    Huai, Jianguo
    Zheng, Yushan
    MEDICAL IMAGE ANALYSIS, 2024, 35
  • [25] Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval
    Lu, Yuhang
    Yu, Jing
    Liu, Yanbing
    Tan, Jianlong
    Guo, Li
    Zhang, Weifeng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2018), PT I, 2018, 11061 : 213 - 225
  • [26] Deep cross-modal hashing with fine-grained similarity
    Yangdong Chen
    Jiaqi Quan
    Yuejie Zhang
    Rui Feng
    Tao Zhang
    Applied Intelligence, 2023, 53 : 28954 - 28973
  • [27] Deep cross-modal hashing with fine-grained similarity
    Chen, Yangdong
    Quan, Jiaqi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    APPLIED INTELLIGENCE, 2023, 53 (23) : 28954 - 28973
  • [28] PBLF: Prompt Based Learning Framework for Cross-Modal Recipe Retrieval
    Sun, Jialiang
    Li, Jiao
    ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I, 2022, 1700 : 388 - 402
  • [29] Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval
    Yuan, Zhiqiang
    Zhang, Wenkai
    Fu, Kun
    Li, Xuan
    Deng, Chubo
    Wang, Hongqi
    Sun, Xian
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [30] Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders
    Messina, Nicola
    Amato, Giuseppe
    Esuli, Andrea
    Falchi, Fabrizio
    Gennaro, Claudio
    Marchand-Maillet, Stephane
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (04)