Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

被引:0
|
作者
Wei, Jingxuan [1 ,3 ]
Tan, Cheng [2 ]
Gao, Zhangyang [2 ]
Sun, Linzhuang [1 ,3 ]
Li, Siyuan [2 ]
Yu, Bihui [1 ,3 ]
Guo, Ruifeng [1 ,3 ]
Li, Stan Z. [2 ]
机构
[1] Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Liaoning, China
[2] AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China
[3] University of Chinese Academy of Sciences, Liaoning, China
关键词
Contrastive Learning;
D O I
10.1007/s00521-024-10310-2
中图分类号
学科分类号
摘要
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, primarily focused on multimodal scientific questions and explanations from elementary and high school textbooks, exhibits limitations in providing a comprehensive evaluation across a broader spectrum of open-domain questions. To address this gap, we introduce the COCO Multi-Modal Reasoning (COCO-MMR) dataset, a comprehensive collection of open-ended questions, rationales, and answers derived from the COCO dataset. Unlike previous datasets that rely on multiple-choice questions, our dataset utilizes open-ended questions to more effectively challenge and assess CoT models’ reasoning capabilities. Through comprehensive evaluations and detailed analyses, we demonstrate that our multihop cross-modal attention and sentence-level contrastive learning modules, designed to simulate human thought processes, significantly enhance model comprehension abilities. Experiments confirm the proposed dataset and techniques, showing their potential to advance multimodal reasoning. The data and code are available at https://github.com/weijingxuan/COCO-MMR. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
引用
收藏
页码:20849 / 20861
页数:12
相关论文
共 50 条
  • [21] A multimodal approach of generating 3D human-like talking agent
    Minghao Yang
    Jianhua Tao
    Kaihui Mu
    Ya Li
    Jianfeng Che
    Journal on Multimodal User Interfaces, 2012, 5 : 61 - 68
  • [22] A new human-like walking for the humanoid robot Romeo
    Kalouguine, A.
    De-Leon-Gomez, V
    Chevallereau, C.
    Dalibard, S.
    Aoustin, Y.
    MULTIBODY SYSTEM DYNAMICS, 2021, 53 (04) : 411 - 434
  • [23] Human-like cognition: visual features grouping for hard-to-group text dataset
    Li, Xin
    Liu, Hangyuan
    Tao, Chunfeng
    Han, Ruiyi
    Yang, Shumin
    JOURNAL OF ELECTRONIC IMAGING, 2024, 33 (02)
  • [24] A Learning-based Control Framework for Human-like Whip Targeting
    Wang, Junyi
    Xiong, Xiaofeng
    2024 10TH IEEE RAS/EMBS INTERNATIONAL CONFERENCE FOR BIOMEDICAL ROBOTICS AND BIOMECHATRONICS, BIOROB 2024, 2024, : 550 - 555
  • [25] A new human-like walking for the humanoid robot Romeo
    A. Kalouguine
    V. De-León-Gómez
    C. Chevallereau
    S. Dalibard
    Y. Aoustin
    Multibody System Dynamics, 2021, 53 : 411 - 434
  • [26] An Incremental Learning Framework for Human-Like Redundancy Optimization of Anthropomorphic Manipulators
    Su, Hang
    Qi, Wen
    Hu, Yingbai
    Karimi, Hamid Reza
    Ferrigno, Giancarlo
    De Momi, Elena
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2022, 18 (03) : 1864 - 1872
  • [27] A Robot Human-Like Learning Framework Applied to Unknown Environment Interaction
    Xue, Xianfa
    Zuo, Lei
    Wang, Ning
    COMPLEXITY, 2022, 2022
  • [28] A Learning Framework for Human-Like Time Parameterization of Robot Manipulation Paths
    Chen, Lipeng
    Chen, Xiangchi
    Chi, Wanchao
    Zheng, Yu
    2023 IEEE-RAS 22ND INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, HUMANOIDS, 2023,
  • [29] A Proposed Framework for Human-like Language Processing of ChatGPT in Academic Writing
    Mahyoob M.
    Algaraady J.
    Alblwi A.
    International Journal of Emerging Technologies in Learning, 2023, 18 (14) : 282 - 293
  • [30] Combining fuzzy and case-based reasoning to generate human-like music performances
    Arcos, JL
    de Mántaras, RL
    TECHNOLOGIES FOR CONSTRUCTING INTELLIGENT SYSTEMS 1: TASKS, 2002, 89 : 21 - 31