Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

被引:0
|
作者
Wei, Jingxuan [1 ,3 ]
Tan, Cheng [2 ]
Gao, Zhangyang [2 ]
Sun, Linzhuang [1 ,3 ]
Li, Siyuan [2 ]
Yu, Bihui [1 ,3 ]
Guo, Ruifeng [1 ,3 ]
Li, Stan Z. [2 ]
机构
[1] Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Liaoning, China
[2] AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China
[3] University of Chinese Academy of Sciences, Liaoning, China
关键词
Contrastive Learning;
D O I
10.1007/s00521-024-10310-2
中图分类号
学科分类号
摘要
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, primarily focused on multimodal scientific questions and explanations from elementary and high school textbooks, exhibits limitations in providing a comprehensive evaluation across a broader spectrum of open-domain questions. To address this gap, we introduce the COCO Multi-Modal Reasoning (COCO-MMR) dataset, a comprehensive collection of open-ended questions, rationales, and answers derived from the COCO dataset. Unlike previous datasets that rely on multiple-choice questions, our dataset utilizes open-ended questions to more effectively challenge and assess CoT models’ reasoning capabilities. Through comprehensive evaluations and detailed analyses, we demonstrate that our multihop cross-modal attention and sentence-level contrastive learning modules, designed to simulate human thought processes, significantly enhance model comprehension abilities. Experiments confirm the proposed dataset and techniques, showing their potential to advance multimodal reasoning. The data and code are available at https://github.com/weijingxuan/COCO-MMR. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
引用
收藏
页码:20849 / 20861
页数:12
相关论文
共 50 条
  • [1] Human-Like Spatial Reasoning Formalisms
    Walega, Przemyslaw Andrze
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 5054 - 5055
  • [2] Human-like Visual Learning and Reasoning
    Cui, Peng
    Zhu, Wenwu
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1951 - 1952
  • [3] Projection: a mechanism for human-like reasoning in Artificial Intelligence
    Guerin, F.
    JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2023, 35 (08) : 1269 - 1293
  • [4] GNOSTRON: a framework for human-like machine understanding
    Yufik, Yan M.
    2018 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI), 2018, : 136 - 145
  • [5] Research on Expansion Method of Detection Dataset for "Human-like" Socialbots
    Liu X.
    Xu Y.
    Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2022, 51 (01): : 130 - 137
  • [6] FUZZY-SYSTEMS FOR SIMULATING HUMAN-LIKE REASONING AND CONTROL
    QUARANTA, TF
    JOHNS HOPKINS APL TECHNICAL DIGEST, 1995, 16 (01): : 43 - 58
  • [7] CLeAR: Continual Learning on Algorithmic Reasoning for Human-like Intelligence
    Kang, Bong Gyun
    Kim, HyunGi
    Jung, Dahuin
    Yoon, Sungroh
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] Human-Like Multimodal Perception and Purposeful Manipulation for Deformable Objects
    Kaur, Upinder
    Ma, Xin
    Huang, Yuanmeng
    Voyles, Richard M.
    2022 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATION SCIENCE AND ENGINEERING (CASE), 2022, : 1790 - 1797
  • [9] From annotated multimodal corpora to simulated human-like behaviors
    Rehm, Matthias
    Andre, Elisabeth
    MODELLING COMMUNICATION WITH ROBOTS AND VIRTUAL HUMANS, 2008, 4930 : 1 - 17
  • [10] An Efficient Framework for Multiple Tasks in Human-like Robots
    Jeong, Jae Won
    Chang, Pyung Hun
    HUMANOIDS: 2007 7TH IEEE-RAS INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, 2007, : 513 - 519