Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models' Understanding of Discourse Relations

被引:0
|
作者
Miao, Yisong [1 ]
Liu, Hongfu [1 ]
Lei, Wenqiang [2 ]
Chen, Nancy F. [3 ]
Kan, Min-Yen [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Sichuan Univ, Chengdu, Peoples R China
[3] Inst Infocomm Res A STAR, Singapore, Singapore
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While large language models have significantly enhanced the effectiveness of discourse relation classifications, it remains unclear whether their comprehension is faithful and reliable. We provide DISQ, a new method for evaluating the faithfulness of understanding discourse based on question answering. We first employ in-context learning to annotate the reasoning for discourse comprehension, based on the connections among key events within the discourse. Following this, DISQ interrogates the model with a sequence of questions to assess its grasp of core event relations, its resilience to counterfactual queries, as well as its consistency to its previous responses. We then evaluate language models with different architectural designs using DISQ, finding: (1) DISQ presents a significant challenge for all models, with the top-performing GPT model attaining only 41% of the ideal performance in PDTB; (2) DISQ is robust to domain shifts and paraphrase variations; (3) Open-source models generally lag behind their closed-source GPT counterparts, with notable exceptions being those enhanced with chat and code/math features; (4) Our analysis validates the effectiveness of explicitly signalled discourse connectives, the role of contextual information, and the benefits of using historical QA data.
引用
收藏
页码:6277 / 6295
页数:19
相关论文
共 16 条
  • [1] The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language Models
    Qi, Jingyuan
    Xu, Zhiyang
    Shen, Ying
    Liu, Minqian
    Jin, Di
    Wang, Qifan
    Huang, Lifu
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4177 - 4199
  • [2] Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
    Chen, Yuyan
    Wu, Chenwei
    Yan, Songzhou
    Liu, Panjun
    Zhou, Haoyu
    Xiao, Yanghua
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3138 - 3167
  • [3] Curriculum as a Discursive and Performative Space for Subjectivity and Learning: Understanding Immigrant Adolescents' Language Use in Classroom Discourse
    Qin, Kongji
    MODERN LANGUAGE JOURNAL, 2020, 104 (04): : 842 - 859
  • [4] Labeling Explicit Discourse Relations Using Pre-trained Language Models
    Kurfali, Murathan
    TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 79 - 86
  • [5] Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding
    He, Mutian
    Garner, Philip N.
    INTERSPEECH 2023, 2023, : 1109 - 1113
  • [6] ValueCSV: Evaluating Core Socialist Values Understanding in Large Language Models
    Xu, Yuemei
    Hu, Ling
    Qiu, Zihan
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT IV, NLPCC 2024, 2025, 15362 : 346 - 358
  • [7] GRASP: A Novel Benchmark for Evaluating Language GRounding and Situated Physics Understanding in Multimodal Language Models
    Jassimi, Serwan
    Holubar, Mario
    Richter, Annika
    Wolff, Cornelius
    Ohmer, Xenia
    Bruni, Elia
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6297 - 6305
  • [8] DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
    Doris, Anna C.
    Grandi, Daniele
    Tomich, Ryan
    Alam, Md Ferdous
    Ataei, Mohammadmehdi
    Cheong, Hyunmin
    Ahmed, Faez
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [9] ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models
    Ren, Yuanyi
    Ye, Haoran
    Fang, Hanjun
    Zhang, Xin
    Song, Guojie
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 2015 - 2040
  • [10] Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors
    Kubis, Marek
    Skorzewski, Pawel
    Sowminski, Marcin
    Zietkiewicz, Tomasz
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11824 - 11835