Exploring the Capabilities of Large Multimodal Models on Dense Text

被引:0
|
作者
Zhang, Shuo [1 ]
Yang, Biao [1 ]
Li, Zhang [1 ]
Ma, Zhiyin [1 ]
Liu, Yuliang [1 ]
Bai, Xiang [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Large multi-modal model; Dense text visual question answering; Evaluation;
D O I
10.1007/978-3-031-70552-6_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at https://github.com/Yuliang-Liu/MultimodalOCR.
引用
收藏
页码:281 / 298
页数:18
相关论文
共 50 条
  • [41] Visual cognition in multimodal large language models
    Buschoff, Luca M. Schulze
    Akata, Elif
    Bethge, Matthias
    Schulz, Eric
    NATURE MACHINE INTELLIGENCE, 2025, 7 (01) : 96 - 106
  • [42] Multimodal large language models for bioimage analysis
    Zhang, Shanghang
    Dai, Gaole
    Huang, Tiejun
    Chen, Jianxu
    NATURE METHODS, 2024, 21 (08) : 1390 - 1393
  • [43] Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study
    Chen, Yi
    Wang, Rui
    Jiang, Haiyun
    Shi, Shuming
    Xu, Ruifeng
    13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 361 - 374
  • [44] INTRANETS IN LARGE CONSTRUCTION ORGANISATIONS: EXPLORING ADVANCEMENTS, CAPABILITIES AND BARRIERS
    Ingirige, Bingunath
    Sexton, Martin
    JOURNAL OF INFORMATION TECHNOLOGY IN CONSTRUCTION, 2007, 12 : 409 - 427
  • [45] Computing Architecture for Large-Language Models (LLMs) and Large Multimodal Models (LMMs)
    Liang, Bor-Sung
    PROCEEDINGS OF THE 2024 INTERNATIONAL SYMPOSIUM ON PHYSICAL DESIGN, ISPD 2024, 2024, : 233 - 234
  • [46] Negative Capabilities: Investigating Apophasis in AI Text-to-Image Models
    Lucas, Hannah
    RELIGIONS, 2023, 14 (06)
  • [47] Exploring Large-Scale Language Models to Evaluate EEG-Based Multimodal Data for Mental Health
    Hu, Yongquan
    Zhang, Shuning
    Dang, Ting
    Jia, Hong
    Salim, Flora D.
    Hu, Wen
    Quigley, Aaron J.
    COMPANION OF THE 2024 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING, UBICOMP COMPANION 2024, 2024, : 412 - 417
  • [48] Large language models: a survey of their development, capabilities, and applications
    Annepaka, Yadagiri
    Pakray, Partha
    KNOWLEDGE AND INFORMATION SYSTEMS, 2025, 67 (03) : 2967 - 3022
  • [49] Question Generation Capabilities of "Small" Large Language Models
    Berger, Joshua
    Koss, Jonathan
    Stamatakis, Markos
    Hoppe, Anett
    Ewerth, Ralph
    Wartenal, Christian
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 183 - 194
  • [50] Exploration of the capabilities of large language models in preoperative assessment
    Burdon, Robert
    Braunbeck, Kai
    Kotze, Alwyn
    BRITISH JOURNAL OF ANAESTHESIA, 2024, 133 (02) : 460 - 460