Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

被引：0

作者：

Ma, Zheng ^{[1
]}

Pan, Mianzhi ^{[1
]}

Wu, Wenhan ^{[1
]}

Cheng, Kanzhi ^{[1
]}

Zhang, Jianbing ^{[1
]}

Huang, Shujian ^{[1
]}

Chen, Jiajun ^{[1
]}

机构：

[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

美国国家科学基金会;

关键词：

Vision-language Models; Food Benchmark; Evaluation;

D O I：

10.1145/3581783.3611994

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-language models (VLMs) have shown impressive performance in substantial downstream multi-modal tasks. However, only comparing the fine-tuned performance on downstream tasks leads to the poor interpretability of VLMs, which is adverse to their future improvement. Several prior works have identified this issue and used various probing methods under a zero-shot setting to detect VLMs' limitations, but they all examine VLMs using general datasets instead of specialized ones. In practical applications, VLMs are usually applied to specific scenarios, such as e-commerce and news fields, so the generalization of VLMs in specific domains should be given more attention. In this paper, we comprehensively investigate the capabilities of popular VLMs in a specific field, the food domain. To this end, we build a food caption dataset, Food-500 Cap, which contains 24,700 food images with 494 categories. Each image is accompanied by a detailed caption, including fine-grained attributes of food, such as the ingredient, shape, and color. We also provide a culinary culture taxonomy that classifies each food category based on its geographic origin in order to better analyze the performance differences of VLM in different regions. Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain. Furthermore, our research reveals severe bias in VLMs' ability to handle food items from different geographic regions. We adopt diverse probing methods and evaluate nine VLMs belonging to different architectures to verify the aforementioned observations. We hope that our study will bring researchers' attention to VLM's limitations when applying them to the domain of food or culinary cultures, and spur further investigations to address this issue.

引用

页码：5674 / 5685

页数：12

共 8 条

[1] Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition
Sun, Hongbo
He, Xiangteng
Zhou, Jiahuan
Peng, Yuxin
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5828 - 5836
[2] VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Zhou, Wangchunshu
Zeng, Yan
Diao, Shizhe
Zhang, Xinsong
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[3] MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling
Zhao, Zijia
Guo, Longteng
He, Xingjian
Shao, Shuai
Yuan, Zehuan
Liu, Jing
[J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1528 - 1538
[4] Open-set Fine-grained Retrieval via Prompting Vision-Language Evaluator
Wang, Shijie
Chang, Jianlong
Li, Haojie
Wang, Zhihui
Ouyang, Wanli
Tian, Qi
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19381 - 19391
[5] Global-to-Contextual Shared Semantic Learning for Fine-Grained Vision-Language Alignment
Zheng, Min
Wu, Chunpeng
Qin, Jiaqi
Liu, Weiwei
Chen, Ming
Lin, Long
Zhou, Fei
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VIII, 2023, 14261 : 281 - 293
[6] ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data
Varma, Maya
Delbrouck, Jean-Benoit
Hooper, Sarah
Chaudhari, Akshay
Langlotz, Curtis
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22168 - 22178
[7] FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training
Han, Yunpeng
Zhang, Lisai
Chen, Qingcai
Chen, Zhijian
Li, Zhonghua
Yang, Jianxin
Cao, Zhao
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15028 - 15038
[8] Computer Image Analysis as a Method of Evaluating the Quality of Selected Fine-Grained Food Mixtures
Matuszek, Dominika Barbara
Bilos, Lukasz Andrzej
[J]. SUSTAINABILITY, 2021, 13 (06)

← 1 →