Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

被引：0

作者：

Wei, Jingxuan ^{[1
,3
]}

Tan, Cheng ^{[2
]}

Gao, Zhangyang ^{[2
]}

Sun, Linzhuang ^{[1
,3
]}

Li, Siyuan ^{[2
]}

Yu, Bihui ^{[1
,3
]}

Guo, Ruifeng ^{[1
,3
]}

Li, Stan Z. ^{[2
]}

机构：

[1] Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Liaoning, China

[2] AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China

[3] University of Chinese Academy of Sciences, Liaoning, China

来源：

Neural Computing and Applications | 2024年 / 36卷 / 33期

关键词：

Contrastive Learning;

D O I：

10.1007/s00521-024-10310-2

中图分类号：

学科分类号：

摘要：

Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, primarily focused on multimodal scientific questions and explanations from elementary and high school textbooks, exhibits limitations in providing a comprehensive evaluation across a broader spectrum of open-domain questions. To address this gap, we introduce the COCO Multi-Modal Reasoning (COCO-MMR) dataset, a comprehensive collection of open-ended questions, rationales, and answers derived from the COCO dataset. Unlike previous datasets that rely on multiple-choice questions, our dataset utilizes open-ended questions to more effectively challenge and assess CoT models’ reasoning capabilities. Through comprehensive evaluations and detailed analyses, we demonstrate that our multihop cross-modal attention and sentence-level contrastive learning modules, designed to simulate human thought processes, significantly enhance model comprehension abilities. Experiments confirm the proposed dataset and techniques, showing their potential to advance multimodal reasoning. The data and code are available at https://github.com/weijingxuan/COCO-MMR. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.

引用

页码：20849 / 20861

页数：12

共 50 条

[11] A Framework for Human-like Behavior in an Immersive Virtual World
Kuijk, Fons
Van Broeck, Sigurd
Dareau, Claude
Ravenet, Brian
Ochs, Magalie
Apostolakis, Konstantinos
Daras, Petros
Monaghan, David
O'Connor, Noel E.
Wall, Julie
Izquierdo, Ebroul
2013 18TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2013,
[12] Effect of complexity in an oscillatory neural network - Relation to human-like reasoning
Yamanoue, T
FUZZY SETS AND SYSTEMS, 1996, 82 (02) : 253 - 263
[13] Human-like Dexterous Grasping Through Reinforcement Learning and Multimodal Perception
Qi, Wen
Fan, Haoyu
Zheng, Cankun
Su, Hang
Alfayad, Samer
BIOMIMETICS, 2025, 10 (03)
[14] A new mouse model with human-like telomeres
Le Bras, Alexandra
LAB ANIMAL, 2025, 54 (03) : 63 - 63
[15] A Learning Framework for Controlling Robotic Manipulators with Human-like Actions
Zhao L.
Yang T.
Yu P.
Yang Y.
Jiqiren/Robot, 2023, 45 (05): : 513 - 522
[16] Effect of complexity in an oscillatory neural network-relation to human-like reasoning
Kyushu Inst of Technology, Kitakyushu, Japan
Fuzzy Sets Syst, 2 (253-263):
[17] In Search of Trustworthy and Transparent Intelligent Systems With Human-Like Cognitive and Reasoning Capabilities
Pal, Nikhil R.
FRONTIERS IN ROBOTICS AND AI, 2020, 7
[18] Learning of human-like algebraic reasoning using deep feedforward neural networks
Cai, Cheng-Hao
Xu, Yanyan
Ke, Dengfeng
Su, Kaile
BIOLOGICALLY INSPIRED COGNITIVE ARCHITECTURES, 2018, 25 : 43 - 50
[19] A Comprehensive Approach to the Generation of Human-Like Arm Movements on Robot NAO
Wei, Yuan
IEEE ACCESS, 2020, 8 : 172869 - 172881
[20] A multimodal approach of generating 3D human-like talking agent
Yang, Minghao
Tao, Jianhua
Mu, Kaihui
Li, Ya
Che, Jianfeng
JOURNAL ON MULTIMODAL USER INTERFACES, 2012, 5 (1-2) : 61 - 68

← 1 2 3 4 5 →