Exploring the Capabilities of Large Multimodal Models on Dense Text

被引:0
|
作者
Zhang, Shuo [1 ]
Yang, Biao [1 ]
Li, Zhang [1 ]
Ma, Zhiyin [1 ]
Liu, Yuliang [1 ]
Bai, Xiang [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Large multi-modal model; Dense text visual question answering; Evaluation;
D O I
10.1007/978-3-031-70552-6_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at https://github.com/Yuliang-Liu/MultimodalOCR.
引用
收藏
页码:281 / 298
页数:18
相关论文
共 50 条
  • [21] MULTIWAY-ADAPTER: ADAPTING MULTIMODAL LARGE LANGUAGE MODELS FOR SCALABLE IMAGE-TEXT RETRIEVAL
    Long, Zijun
    Killick, George
    McCreadie, Richard
    Camarasa, Gerardo Aragon
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6580 - 6584
  • [22] Panel: Multimodal Large Foundation Models
    Kankanhalli, Mohan
    Worring, Marcel
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9709 - 9709
  • [23] A survey on multimodal large language models
    Yin, Shukang
    Fu, Chaoyou
    Zhao, Sirui
    Li, Ke
    Sun, Xing
    Xu, Tong
    Chen, Enhong
    NATIONAL SCIENCE REVIEW, 2024, 11 (12)
  • [24] A survey on multimodal large language models
    Shukang Yin
    Chaoyou Fu
    Sirui Zhao
    Ke Li
    Xing Sun
    Tong Xu
    Enhong Chen
    National Science Review, 2024, 11 (12) : 277 - 296
  • [25] Transferring General Multimodal Pretrained Models to Text Recognition
    Lin, Junyang
    Ren, Xuancheng
    Zhang, Yichang
    Liu, Gao
    Wang, Peng
    Yang, An
    Zhou, Chang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 588 - 597
  • [26] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models
    Chen, Zheyi
    Xu, Liuchang
    Zheng, Hongting
    Chen, Luyao
    Tolba, Amr
    Zhao, Liang
    Yu, Keping
    Feng, Hailin
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 80 (02): : 1753 - 1808
  • [27] Exploring Dynamic Capabilities and Business Models in Acquisition Processes
    Cirjevskis, Andrejs
    2015 4th International Conference on Social Sciences and Society (ICSSS 2015), Pt 1, 2015, 70 : 316 - 322
  • [28] Exploring image-text combinations in visual humour through large language models (LLMs)
    Soriano-Gonzalez, Laura
    Belda-Medina, Jose
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2024,
  • [29] AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models
    Krauth, Karl
    Bonilla, Edwin V.
    Cutajar, Kurt
    Filippone, Maurizio
    CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE (UAI2017), 2017,
  • [30] Multimodal Topic Modeling by Exploring Characteristics of Short Text Social Media
    Zhang, Huakui
    Cai, Yi
    Ren, Haopeng
    Li, Qing
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2430 - 2445