Exploring the Capabilities of Large Multimodal Models on Dense Text

被引:0
|
作者
Zhang, Shuo [1 ]
Yang, Biao [1 ]
Li, Zhang [1 ]
Ma, Zhiyin [1 ]
Liu, Yuliang [1 ]
Bai, Xiang [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Large multi-modal model; Dense text visual question answering; Evaluation;
D O I
10.1007/978-3-031-70552-6_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at https://github.com/Yuliang-Liu/MultimodalOCR.
引用
收藏
页码:281 / 298
页数:18
相关论文
共 50 条
  • [1] Exploring Capabilities of Large Language Models such as ChatGPT in Radiation
    Dennstadt, Fabio
    Hastings, Janna
    Putora, Paul Martin
    Vu, Erwin
    Fischer, Galina F.
    Suveg, Krisztian
    Glatzer, Markus
    Riggenbach, Elena
    Ha, Hong-Linh
    Cihoric, Nikola
    ADVANCES IN RADIATION ONCOLOGY, 2024, 9 (03)
  • [2] Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications
    Arnold, Rahel
    Schuldt, Heiko
    PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024, 2024, : 26 - 35
  • [3] Unleashing the AI revolution: exploring the capabilities and challenges of large language models and text-to-image AI programs
    Youssef, A.
    ULTRASOUND IN OBSTETRICS & GYNECOLOGY, 2023, 62 (02) : 308 - 312
  • [4] Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
    Zhang, Yichi
    Dong, Yinpeng
    Zhang, Siyuan
    Min, Tianzan
    Su, Hang
    Zhu, Jun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26552 - 26562
  • [5] Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
    Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University, China
    不详
    不详
    arXiv, 1600,
  • [6] Exploring the capabilities and limitations of large language models in the electric energy sector
    Majumder, Subir
    Dong, Lin
    Doudi, Fatemeh
    Cai, Yuting
    Tian, Chao
    Kalathil, Dileep
    Ding, Kevin
    Thatte, Anupam A.
    Li, Na
    Xie, Le
    JOULE, 2024, 8 (06) : 1544 - 1549
  • [7] Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints
    Lu, Albert
    Zhang, Hongxin
    Zhang, Yanzhe
    Wang, Xuezhi
    Yang, Diyi
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1982 - 2008
  • [8] Exploring the Capabilities and Limitations of Large Language Models for Radiation Oncology Decision Support
    Putz, Florian
    Haderlein, Marlen
    Lettmaier, Sebastian
    Semrau, Sabine
    Fietkau, Rainer
    Huang, Yixing
    INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 2024, 118 (04): : 900 - 904
  • [9] Exploring the Potential of Large Language Models for Text-Based Personality Prediction
    Molchanova, Maria
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 291 - 301
  • [10] Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification
    Liang, Tao
    Lin, Guosheng
    Wan, Mingyang
    Li, Tianrui
    Ma, Guojun
    Lv, Fengmao
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15471 - 15480