Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引:0
|
作者
Xu, Fangzhi [1 ]
Lin, Qika [1 ]
Han, Jiawei [1 ]
Zhao, Tianzhe [1 ]
Liu, Jun [2 ]
Cambria, Erik [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China
[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China
[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore
基金
中国国家自然科学基金;
关键词
Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;
D O I
10.1109/TKDE.2025.3536008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.
引用
收藏
页码:1620 / 1634
页数:15
相关论文
共 50 条
  • [21] A comprehensive evaluation of large language models in mining gene relations and pathway knowledge
    Muhammad Azam
    Yibo Chen
    Micheal Olaolu Arowolo
    Haowang Liu
    Mihail Popescu
    Dong Xu
    Quantitative Biology, 2024, 12 (04) : 360 - 374
  • [22] A comprehensive evaluation of large language models in mining gene relations and pathway knowledge
    Azam, Muhammad
    Chen, Yibo
    Arowolo, Micheal Olaolu
    Liu, Haowang
    Popescu, Mihail
    Xu, Dong
    QUANTITATIVE BIOLOGY, 2024, 12 (04) : 360 - 374
  • [23] Improving Large Language Models in Event Relation Logical Prediction
    Chen, Meiqi
    Ma, Yubo
    Song, Kaitao
    Cao, Yixin
    Zhang, Yan
    Li, Dongsheng
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 9451 - 9478
  • [24] Large Language Models on Graphs: A Comprehensive Survey
    Jin, Bowen
    Liu, Gang
    Han, Chi
    Jiang, Meng
    Ji, Heng
    Han, Jiawei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 8622 - 8642
  • [25] LLMBox: A Comprehensive Library for Large Language Models
    Tang, Tianyi
    Hui, Yiwen
    Li, Bingqian
    Lu, Wenyang
    Qin, Zijing
    Sun, Haoxiang
    Wang, Jiapeng
    Xu, Shiyi
    Cheng, Xiaoxue
    Guo, Geyang
    Peng, Han
    Zheng, Bowen
    Tang, Yiru
    Min, Yingqian
    Chen, Yushuo
    Chen, Jie
    Zhao, Yuanqian
    Ding, Luran
    Wang, Yuhao
    Dong, Zican
    Xia, Chunxuan
    Li, Junyi
    Zhou, Kun
    Zhao, Wayne Xin
    Wen, Ji-Rong
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 388 - 399
  • [26] Large Language Models: A Comprehensive Guide for Radiologists
    Kim, Sunkyu
    Lee, Choong-kun
    Kim, Seung-seob
    JOURNAL OF THE KOREAN SOCIETY OF RADIOLOGY, 2024, 85 (05): : 861 - 882
  • [27] Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment
    Kjell, Oscar N. E.
    Kjell, Katarina
    Schwartz, H. Andrew
    PSYCHIATRY RESEARCH, 2024, 333
  • [28] LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
    Xu, Peng
    Shao, Wenqi
    Zhang, Kaipeng
    Gao, Peng
    Liu, Shuo
    Lei, Meng
    Meng, Fanqing
    Huang, Siyuan
    Qiao, Yu
    Luo, Ping
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (03) : 1877 - 1893
  • [29] What do Users Really Ask Large Language Models?
    Trippas, Johanne R.
    Al Lawati, Sara Fahad Dawood
    Mackenzie, Joel
    Gallagher, Luke
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2703 - 2707
  • [30] Will Large Language Models Really Change How Work Is Done?
    Cappelli, Peter
    Tambe, Prasanna
    Yakubovich, Valery
    MIT SLOAN MANAGEMENT REVIEW, 2024, 65 (03)