BLINK: Multimodal Large Language Models Can See but Not Perceive

被引:0
|
作者
Fu, Xingyu [1 ]
Hu, Yushi [2 ,3 ]
Li, Bangzheng [4 ]
Feng, Yu [1 ]
Wang, Haoyu [1 ]
Lin, Xudong [5 ]
Roth, Dan [1 ]
Smith, Noah A. [2 ,3 ]
Ma, Wei-Chiu [3 ]
Krishna, Ranjay [2 ,3 ]
机构
[1] Univ Penn, Philadelphia, PA 19104 USA
[2] Univ Washington, Seattle, WA 98195 USA
[3] Allen Inst AI, Seattle, WA 98103 USA
[4] Univ Calif Davis, Davis, CA USA
[5] Columbia Univ, New York, NY USA
来源
关键词
Multi-modal Large Language Models; Vision-Language Benchmark; Visual Perception Evaluation;
D O I
10.1007/978-3-031-73337-6_9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce BLINK, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the BLINK tasks can be solved by humans "within a BLINK" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. BLINK reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, BLINK is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe BLINK will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
引用
收藏
页码:148 / 166
页数:19
相关论文
共 50 条
  • [11] Large language models and multimodal foundation models for precision oncology
    Truhn, Daniel
    Eckardt, Jan-Niklas
    Ferber, Dyke
    Kather, Jakob Nikolas
    NPJ PRECISION ONCOLOGY, 2024, 8 (01)
  • [12] Large language models and multimodal foundation models for precision oncology
    Daniel Truhn
    Jan-Niklas Eckardt
    Dyke Ferber
    Jakob Nikolas Kather
    npj Precision Oncology, 8
  • [13] TRINS: Towards Multimodal Language Models that Can Read
    Zhang, Ruiyi
    Zhang, Yanzhe
    Chen, Jian
    Zhou, Yufan
    Gu, Jiuxiang
    Chen, Changyou
    Sun, Tong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 22584 - 22594
  • [14] Contextual Object Detection with Multimodal Large Language Models
    Zang, Yuhang
    Li, Wei
    Han, Jun
    Zhou, Kaiyang
    Loy, Chen Change
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 825 - 843
  • [15] Investigating the Catastrophic Forgetting in Multimodal Large Language Models
    Zhai, Yuexiang
    Tong, Shengbang
    Li, Xiao
    Cai, Mu
    Qu, Qing
    Lee, Yong Jae
    Ma, Yi
    CONFERENCE ON PARSIMONY AND LEARNING, VOL 234, 2024, 234 : 202 - 227
  • [16] Multimodal Food Image Classification with Large Language Models
    Kim, Jun-Hwa
    Kim, Nam-Ho
    Jo, Donghyeok
    Won, Chee Sun
    ELECTRONICS, 2024, 13 (22)
  • [17] A Survey on Multimodal Large Language Models for Autonomous Driving
    Cui, Can
    Ma, Yunsheng
    Cao, Xu
    Ye, Wenqian
    Zhou, Yang
    Liang, Kaizhao
    Chen, Jintai
    Lu, Juanwu
    Yang, Zichong
    Liao, Kuei-Da
    Gao, Tianren
    Li, Erlong
    Tang, Kun
    Cao, Zhipeng
    Zhou, Tong
    Liu, Ao
    Yan, Xinrui
    Mei, Shuqi
    Cao, Jianguo
    Wang, Ziran
    Zheng, Chao
    2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 958 - 979
  • [18] Woodpecker: hallucination correction for multimodal large language models
    Yin, Shukang
    Fu, Chaoyou
    Zhao, Sirui
    Xu, Tong
    Wang, Hao
    Sui, Dianbo
    Shen, Yunhang
    Li, Ke
    Sun, Xing
    Chen, Enhong
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (12)
  • [19] Do multimodal large language models understand welding?
    Khvatskii, Grigorii
    Lee, Yong Suk
    Angst, Corey
    Gibbs, Maria
    Landers, Robert
    Chawla, Nitesh V.
    INFORMATION FUSION, 2025, 120
  • [20] Woodpecker: hallucination correction for multimodal large language models
    Shukang YIN
    Chaoyou FU
    Sirui ZHAO
    Tong XU
    Hao WANG
    Dianbo SUI
    Yunhang SHEN
    Ke LI
    Xing SUN
    Enhong CHEN
    Science China(Information Sciences), 2024, 67 (12) : 52 - 64