Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

被引:0
|
作者
Ma, Zheng [1 ]
Pan, Mianzhi [1 ]
Wu, Wenhan [1 ]
Cheng, Kanzhi [1 ]
Zhang, Jianbing [1 ]
Huang, Shujian [1 ]
Chen, Jiajun [1 ]
机构
[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing, Peoples R China
基金
美国国家科学基金会;
关键词
Vision-language Models; Food Benchmark; Evaluation;
D O I
10.1145/3581783.3611994
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language models (VLMs) have shown impressive performance in substantial downstream multi-modal tasks. However, only comparing the fine-tuned performance on downstream tasks leads to the poor interpretability of VLMs, which is adverse to their future improvement. Several prior works have identified this issue and used various probing methods under a zero-shot setting to detect VLMs' limitations, but they all examine VLMs using general datasets instead of specialized ones. In practical applications, VLMs are usually applied to specific scenarios, such as e-commerce and news fields, so the generalization of VLMs in specific domains should be given more attention. In this paper, we comprehensively investigate the capabilities of popular VLMs in a specific field, the food domain. To this end, we build a food caption dataset, Food-500 Cap, which contains 24,700 food images with 494 categories. Each image is accompanied by a detailed caption, including fine-grained attributes of food, such as the ingredient, shape, and color. We also provide a culinary culture taxonomy that classifies each food category based on its geographic origin in order to better analyze the performance differences of VLM in different regions. Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain. Furthermore, our research reveals severe bias in VLMs' ability to handle food items from different geographic regions. We adopt diverse probing methods and evaluate nine VLMs belonging to different architectures to verify the aforementioned observations. We hope that our study will bring researchers' attention to VLM's limitations when applying them to the domain of food or culinary cultures, and spur further investigations to address this issue.
引用
收藏
页码:5674 / 5685
页数:12
相关论文
共 8 条
  • [1] Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition
    Sun, Hongbo
    He, Xiangteng
    Zhou, Jiahuan
    Peng, Yuxin
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5828 - 5836
  • [2] VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
    Zhou, Wangchunshu
    Zeng, Yan
    Diao, Shizhe
    Zhang, Xinsong
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [3] MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling
    Zhao, Zijia
    Guo, Longteng
    He, Xingjian
    Shao, Shuai
    Yuan, Zehuan
    Liu, Jing
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1528 - 1538
  • [4] Open-set Fine-grained Retrieval via Prompting Vision-Language Evaluator
    Wang, Shijie
    Chang, Jianlong
    Li, Haojie
    Wang, Zhihui
    Ouyang, Wanli
    Tian, Qi
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19381 - 19391
  • [5] Global-to-Contextual Shared Semantic Learning for Fine-Grained Vision-Language Alignment
    Zheng, Min
    Wu, Chunpeng
    Qin, Jiaqi
    Liu, Weiwei
    Chen, Ming
    Lin, Long
    Zhou, Fei
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VIII, 2023, 14261 : 281 - 293
  • [6] ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data
    Varma, Maya
    Delbrouck, Jean-Benoit
    Hooper, Sarah
    Chaudhari, Akshay
    Langlotz, Curtis
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22168 - 22178
  • [7] FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training
    Han, Yunpeng
    Zhang, Lisai
    Chen, Qingcai
    Chen, Zhijian
    Li, Zhonghua
    Yang, Jianxin
    Cao, Zhao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15028 - 15038
  • [8] Computer Image Analysis as a Method of Evaluating the Quality of Selected Fine-Grained Food Mixtures
    Matuszek, Dominika Barbara
    Bilos, Lukasz Andrzej
    [J]. SUSTAINABILITY, 2021, 13 (06)