Causal Dataset Discovery with Large Language Models

被引:0
|
作者
Liu, Junfei [1 ]
Sun, Shaotong [1 ]
Nargesian, Fatemeh [1 ]
机构
[1] Univ Rochester, 601 Elmwood Ave, Rochester, NY 14627 USA
关键词
SEARCH;
D O I
10.1145/3665939.3665968
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with columns in a query dataset. Discovering causal links from large-scale repositories faces three major challenges: vast scale of data, inherent sparsity of causal links, and incompleteness of variables present. Identifying causal relationships among datasets is a complex and time-intensive task, especially because it requires joining datasets, to bring all variables together, before applying causal link discovery. In this paper, we introduce the Causal Dataset Discovery problem and propose a large language model (LLM)-based framework to discover potential pairwise causal links between columns from different datasets. We heuristically improve LLM's grasp of causality through prompting and fine-tuning and prevent the extreme imbalance in causal candidate distributions due to natural sparsity of causal connections. We create benchmarks specific to this task1, experimentally show that our framework achieves remarkable performance with GPT-3.5 and GPT-4. We summarize the distinctive behaviors of different LLM strategies, and discuss improvements for future research.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Benchmarking Causal Study to Interpret Large Language Models for Source Code
    Rodriguez-Cardenas, Daniel
    Palacio, David N.
    Khati, Dipin
    Burke, Henry
    Poshyvanyk, Denys
    2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 329 - 334
  • [22] Are Large Language Models Capable of Causal Reasoning for Sensing Data Analysis?
    Hu, Zhizhang
    Zhang, Yue
    Rossi, Ryan
    Yu, Tong
    Kim, Sungchul
    Pan, Shijia
    PROCEEDINGS OF THE 2024 WORKSHOP ON EDGE AND MOBILE FOUNDATION MODELS, EDGEFM 2024, 2024, : 24 - 29
  • [23] Does Metacognitive Prompting Improve Causal Inference in Large Language Models?
    Ohtani, Ryusei
    Sakurai, Yuko
    Oyama, Satoshi
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 458 - 459
  • [24] Exploring Synergies between Causal Models and Large Language Models for Enhanced Understanding and Inference
    Sun, Yaru
    Yang, Ying
    Fu, Wenhao
    2024 2ND ASIA CONFERENCE ON COMPUTER VISION, IMAGE PROCESSING AND PATTERN RECOGNITION, CVIPPR 2024, 2024,
  • [25] Causal-Guided Active Learning for Debiasing Large Language Models
    Sun, Zhouhao
    Li Du
    Ding, Xiao
    Ma, Yixuan
    Zhao, Yang
    Qiu, Kaitao
    Liu, Ting
    Qin, Bing
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14455 - 14469
  • [26] The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code
    Liu, Xiao
    Yin, Da
    Zhang, Chen
    Feng, Yansong
    Zhao, Dongyan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 9009 - 9022
  • [27] Evaluation of large language models for discovery of gene set function
    Hu, Mengzhou
    Alkhairy, Sahar
    Lee, Ingoo
    Pillich, Rudolf T.
    Fong, Dylan
    Smith, Kevin
    Bachelder, Robin
    Ideker, Trey
    Pratt, Dexter
    NATURE METHODS, 2025, 22 (01) : 82 - 91
  • [28] Large language models for scientific discovery in molecular property prediction
    Zheng, Yizhen
    Koh, Huan Yee
    Ju, Jiaxin
    Nguyen, Anh T. N.
    May, Lauren T.
    Webb, Geoffrey I.
    Pan, Shirui
    NATURE MACHINE INTELLIGENCE, 2025, 7 (03) : 437 - 447
  • [29] Large language models for automatic equation discovery of nonlinear dynamics
    Du, Mengge
    Chen, Yuntian
    Wang, Zhongzheng
    Nie, Longfeng
    Zhang, Dongxiao
    PHYSICS OF FLUIDS, 2024, 36 (09)
  • [30] CanWe Utilize Pre-trained Language Models within Causal Discovery Algorithms?
    Lee, Chanhui
    Kim, Juhyeon
    Jeong, Yongjun
    Lyu, Juhyun
    Kim, Junghee
    Lee, Sangmin
    Han, Sangjun
    Choe, Hyeokjun
    Park, Soyeon
    Lim, Woohyung
    Lim, Sungbin
    Lee, Sanghack
    arXiv, 2023,