Causal Dataset Discovery with Large Language Models

被引:0
|
作者
Liu, Junfei [1 ]
Sun, Shaotong [1 ]
Nargesian, Fatemeh [1 ]
机构
[1] Univ Rochester, 601 Elmwood Ave, Rochester, NY 14627 USA
关键词
SEARCH;
D O I
10.1145/3665939.3665968
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with columns in a query dataset. Discovering causal links from large-scale repositories faces three major challenges: vast scale of data, inherent sparsity of causal links, and incompleteness of variables present. Identifying causal relationships among datasets is a complex and time-intensive task, especially because it requires joining datasets, to bring all variables together, before applying causal link discovery. In this paper, we introduce the Causal Dataset Discovery problem and propose a large language model (LLM)-based framework to discover potential pairwise causal links between columns from different datasets. We heuristically improve LLM's grasp of causality through prompting and fine-tuning and prevent the extreme imbalance in causal candidate distributions due to natural sparsity of causal connections. We create benchmarks specific to this task1, experimentally show that our framework achieves remarkable performance with GPT-3.5 and GPT-4. We summarize the distinctive behaviors of different LLM strategies, and discuss improvements for future research.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] Causal Distillation for Language Models
    Wu, Zhengxuan
    Geiger, Atticus
    Rozner, Joshua
    Kreiss, Elisa
    Lu, Hanson
    Icard, Thomas
    Potts, Christopher
    Goodman, Noah
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4288 - 4295
  • [32] On Inter-Dataset Code Duplication and Data Leakage in Large Language Models
    Lopez, Jose Antonio Hernandez
    Chen, Boqi
    Saad, Mootez
    Sharma, Tushar
    Varro, Daniel
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2025, 51 (01) : 192 - 205
  • [33] CR-LLM: A Dataset and Optimization for Concept Reasoning of Large Language Models
    Li, Nianqi
    Liu, Jingping
    Jiang, Sihang
    Jiang, Haiyun
    Xiao, Yanghua
    Liang, Jiaqing
    Liang, Zujie
    Wei, Feng
    Chen, Jinglei
    Hao, Zhenghong
    Han, Bing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 13737 - 13747
  • [34] DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
    Li, Haodong
    Zhang, Xiaofeng
    Qu, Haicheng
    REMOTE SENSING, 2025, 17 (04)
  • [35] Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System
    Huang, Hengguan
    Wang, Songtao
    Liu, Hongfu
    Wang, Hao
    Wang, Ye
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1624 - 1637
  • [36] The discovery of causal models with small samples
    Dai, HH
    Korb, K
    Wallace, C
    ANZIIS 96 - 1996 AUSTRALIAN NEW ZEALAND CONFERENCE ON INTELLIGENT INFORMATION SYSTEMS, PROCEEDINGS, 1996, : 27 - 30
  • [37] DrugGen enhances drug discovery with large language models and reinforcement learning
    Mahsa Sheikholeslami
    Navid Mazrouei
    Yousof Gheisari
    Afshin Fasihi
    Matin Irajpour
    Ali Motahharynia
    Scientific Reports, 15 (1)
  • [38] Large Language Models, the ‘Doctrine of Discovery’ and ‘Terra Nullius’ Declared Again?
    Clear, Tony
    ACM Inroads, 2024, 15 (02) : 6 - 9
  • [39] Leveraging Large Language Models for Enhancing Literature-Based Discovery
    Taleb, Ikbal
    Navaz, Alramzana Nujum
    Serhani, Mohamed Adel
    BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (11)
  • [40] Transformative Movie Discovery: Large Language Models for Recommendation and Genre Prediction
    Raj, Subham
    Sharma, Anurag
    Saha, Sriparna
    Singh, Brijraj
    Pedanekar, Niranjan
    IEEE ACCESS, 2024, 12 : 186626 - 186638