Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

被引:0
|
作者
Wei, Jingxuan [1 ,3 ]
Tan, Cheng [2 ]
Gao, Zhangyang [2 ]
Sun, Linzhuang [1 ,3 ]
Li, Siyuan [2 ]
Yu, Bihui [1 ,3 ]
Guo, Ruifeng [1 ,3 ]
Li, Stan Z. [2 ]
机构
[1] Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Liaoning, China
[2] AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China
[3] University of Chinese Academy of Sciences, Liaoning, China
关键词
Contrastive Learning;
D O I
10.1007/s00521-024-10310-2
中图分类号
学科分类号
摘要
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, primarily focused on multimodal scientific questions and explanations from elementary and high school textbooks, exhibits limitations in providing a comprehensive evaluation across a broader spectrum of open-domain questions. To address this gap, we introduce the COCO Multi-Modal Reasoning (COCO-MMR) dataset, a comprehensive collection of open-ended questions, rationales, and answers derived from the COCO dataset. Unlike previous datasets that rely on multiple-choice questions, our dataset utilizes open-ended questions to more effectively challenge and assess CoT models’ reasoning capabilities. Through comprehensive evaluations and detailed analyses, we demonstrate that our multihop cross-modal attention and sentence-level contrastive learning modules, designed to simulate human thought processes, significantly enhance model comprehension abilities. Experiments confirm the proposed dataset and techniques, showing their potential to advance multimodal reasoning. The data and code are available at https://github.com/weijingxuan/COCO-MMR. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
引用
收藏
页码:20849 / 20861
页数:12
相关论文
共 50 条
  • [31] Towards Human-Like Bots Using Online Interactive Case-Based Reasoning
    Miranda, Maximiliano
    Sanchez-Ruiz, Antonio A.
    Peinado, Federico
    CASE-BASED REASONING RESEARCH AND DEVELOPMENT, ICCBR 2019, 2019, 11680 : 314 - 328
  • [32] A Novel Human-Like Control Framework for Mobile Medical Service Robot
    Zhang, Xin
    Li, Jiehao
    Qi, Wen
    Zhou, Xuanyi
    Hu, Yingbai
    Quan, Hao
    Wang, Zhen
    COMPLEXITY, 2020, 2020
  • [33] Evaluating Large Language Models with NeuBAROCO: Syllogistic Reasoning Ability and Human-like Biases
    Ando, Risako
    Morishita, Takanobu
    Abe, Hirohiko
    Mineshima, Koji
    Okada, Mitsuhiro
    arXiv, 2023,
  • [34] Semantic Segmentation Optimization in Power Systems: Enhancing Human-Like Switching Operations
    Hua, Jin
    Zhao, Yue
    Zhang, Huijun
    Zhao, Haiming
    Wang, Lei
    TRAITEMENT DU SIGNAL, 2023, 40 (04) : 1401 - 1412
  • [35] Development of a new human-like talking robot for human vocal mimicry
    Fukui, K
    Nishikawa, K
    Kuwae, T
    Takanobu, H
    Mochida, T
    Honda, M
    Takanishi, A
    2005 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), VOLS 1-4, 2005, : 1437 - 1442
  • [36] Leveraging Multimodal Sensing and Topometric Mapping for Human-Like Autonomous Navigation in Complex Environments
    Tsiakas, Kosmas
    Alexiou, Dimitrios
    Giakoumis, Dimitrios
    Gasteratos, Antonios
    Tzovaras, Dimitrios
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 7415 - 7421
  • [37] TOWARD HUMAN-LIKE COMMUNICATIONS - A LIFE - A NEW RESEARCH PARADIGM
    HABARA, F
    OPTOELECTRONICS-DEVICES AND TECHNOLOGIES, 1995, 10 (03): : 432 - 433
  • [38] FMCH: a new model for human-like postural control in walking
    Sharbafi, Maziar A.
    Seyfarth, Andre
    2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2015, : 5742 - 5747
  • [39] RealBehavior: A Framework for Faithfully Characterizing Foundation Models' Human-like Behavior Mechanisms
    Zhou, Enyu
    Zheng, Rui
    Xi, Zhiheng
    Gao, Songyang
    Fan, Xiaoran
    Fei, Zichu
    Ye, Jingting
    Gui, Tao
    Zhang, Qi
    Huang, Xuanjing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 10262 - 10274
  • [40] Video Denoising for Scenes With Challenging Motion: A Comprehensive Analysis and a New Framework
    Chen, Huaian
    Wang, Jianfeng
    Duan, Minghui
    Jin, Yi
    Kan, Yan
    Zhu, Changan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5704 - 5719