Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

被引:0
|
作者
Wei, Jingxuan [1 ,3 ]
Tan, Cheng [2 ]
Gao, Zhangyang [2 ]
Sun, Linzhuang [1 ,3 ]
Li, Siyuan [2 ]
Yu, Bihui [1 ,3 ]
Guo, Ruifeng [1 ,3 ]
Li, Stan Z. [2 ]
机构
[1] Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Liaoning, China
[2] AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China
[3] University of Chinese Academy of Sciences, Liaoning, China
关键词
Contrastive Learning;
D O I
10.1007/s00521-024-10310-2
中图分类号
学科分类号
摘要
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, primarily focused on multimodal scientific questions and explanations from elementary and high school textbooks, exhibits limitations in providing a comprehensive evaluation across a broader spectrum of open-domain questions. To address this gap, we introduce the COCO Multi-Modal Reasoning (COCO-MMR) dataset, a comprehensive collection of open-ended questions, rationales, and answers derived from the COCO dataset. Unlike previous datasets that rely on multiple-choice questions, our dataset utilizes open-ended questions to more effectively challenge and assess CoT models’ reasoning capabilities. Through comprehensive evaluations and detailed analyses, we demonstrate that our multihop cross-modal attention and sentence-level contrastive learning modules, designed to simulate human thought processes, significantly enhance model comprehension abilities. Experiments confirm the proposed dataset and techniques, showing their potential to advance multimodal reasoning. The data and code are available at https://github.com/weijingxuan/COCO-MMR. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
引用
收藏
页码:20849 / 20861
页数:12
相关论文
共 50 条
  • [11] A Framework for Human-like Behavior in an Immersive Virtual World
    Kuijk, Fons
    Van Broeck, Sigurd
    Dareau, Claude
    Ravenet, Brian
    Ochs, Magalie
    Apostolakis, Konstantinos
    Daras, Petros
    Monaghan, David
    O'Connor, Noel E.
    Wall, Julie
    Izquierdo, Ebroul
    2013 18TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2013,
  • [12] Effect of complexity in an oscillatory neural network - Relation to human-like reasoning
    Yamanoue, T
    FUZZY SETS AND SYSTEMS, 1996, 82 (02) : 253 - 263
  • [13] Human-like Dexterous Grasping Through Reinforcement Learning and Multimodal Perception
    Qi, Wen
    Fan, Haoyu
    Zheng, Cankun
    Su, Hang
    Alfayad, Samer
    BIOMIMETICS, 2025, 10 (03)
  • [14] A new mouse model with human-like telomeres
    Le Bras, Alexandra
    LAB ANIMAL, 2025, 54 (03) : 63 - 63
  • [15] A Learning Framework for Controlling Robotic Manipulators with Human-like Actions
    Zhao L.
    Yang T.
    Yu P.
    Yang Y.
    Jiqiren/Robot, 2023, 45 (05): : 513 - 522
  • [16] Effect of complexity in an oscillatory neural network-relation to human-like reasoning
    Kyushu Inst of Technology, Kitakyushu, Japan
    Fuzzy Sets Syst, 2 (253-263):
  • [17] In Search of Trustworthy and Transparent Intelligent Systems With Human-Like Cognitive and Reasoning Capabilities
    Pal, Nikhil R.
    FRONTIERS IN ROBOTICS AND AI, 2020, 7
  • [18] Learning of human-like algebraic reasoning using deep feedforward neural networks
    Cai, Cheng-Hao
    Xu, Yanyan
    Ke, Dengfeng
    Su, Kaile
    BIOLOGICALLY INSPIRED COGNITIVE ARCHITECTURES, 2018, 25 : 43 - 50
  • [19] A Comprehensive Approach to the Generation of Human-Like Arm Movements on Robot NAO
    Wei, Yuan
    IEEE ACCESS, 2020, 8 : 172869 - 172881
  • [20] A multimodal approach of generating 3D human-like talking agent
    Yang, Minghao
    Tao, Jianhua
    Mu, Kaihui
    Li, Ya
    Che, Jianfeng
    JOURNAL ON MULTIMODAL USER INTERFACES, 2012, 5 (1-2) : 61 - 68