Causal Dataset Discovery with Large Language Models

被引:0
|
作者
Liu, Junfei [1 ]
Sun, Shaotong [1 ]
Nargesian, Fatemeh [1 ]
机构
[1] Univ Rochester, 601 Elmwood Ave, Rochester, NY 14627 USA
关键词
SEARCH;
D O I
10.1145/3665939.3665968
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with columns in a query dataset. Discovering causal links from large-scale repositories faces three major challenges: vast scale of data, inherent sparsity of causal links, and incompleteness of variables present. Identifying causal relationships among datasets is a complex and time-intensive task, especially because it requires joining datasets, to bring all variables together, before applying causal link discovery. In this paper, we introduce the Causal Dataset Discovery problem and propose a large language model (LLM)-based framework to discover potential pairwise causal links between columns from different datasets. We heuristically improve LLM's grasp of causality through prompting and fine-tuning and prevent the extreme imbalance in causal candidate distributions due to natural sparsity of causal connections. We create benchmarks specific to this task1, experimentally show that our framework achieves remarkable performance with GPT-3.5 and GPT-4. We summarize the distinctive behaviors of different LLM strategies, and discuss improvements for future research.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] WikiFactDiff: A Large, Realistic, and Temporally Adaptable Dataset for Atomic Factual Knowledge Update in Causal Language Models
    Khodja, Hichem Ammar
    Béchet, Frédéric
    Brabant, Quentin
    Nasr, Alexis
    Lecorvé, Gwénolé
    2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 2024, : 17614 - 17624
  • [2] This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
    Garcia-Ferrero, Iker
    Altuna, Begona
    Alvez, Javier
    Gonzalez-Dios, Itziar
    Rigau, German
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8596 - 8615
  • [3] Understanding the Dataset Practitioners Behind Large Language Models
    Qian, Crystal
    Reif, Emily
    Kahng, Minsuk
    EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024, 2024,
  • [4] A Chinese Dataset for Evaluating the Safeguards in Large Language Models
    Wang, Yuxia
    Zhai, Zenan
    Li, Haonan
    Han, Xudong
    Lin, Lizhi
    Zhang, Zhenxuan
    Zhao, Jingru
    Nakov, Preslav
    Baldwin, Timothy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 3106 - 3119
  • [5] Intersectional Stereotypes in Large Language Models: Dataset and Analysis
    Ma, Weicheng
    Chiang, Brian
    Wu, Tong
    Wang, Lili
    Vosoughi, Soroush
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8589 - 8597
  • [6] Large language models for causal hypothesis generation in science
    Cohrs, Kai-Hendrik
    Diaz, Emiliano
    Sitokonstantinou, Vasileios
    Varando, Gherardo
    Camps-Valls, Gustau
    MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2025, 6 (01):
  • [7] A Causal View of Entity Bias in (Large) Language Models
    Wang, Fei
    Mo, Wenjie
    Wang, Yiwei
    Zhou, Wenxuan
    Chen, Muhao
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15173 - 15184
  • [8] LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
    Luo, Yulin
    An, Ruichuan
    Zou, Bocheng
    Tang, Yiming
    Liu, Jiaming
    Zhang, Shanghang
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 235 - 252
  • [9] Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models
    Ko, Hyung-Kwon
    Jeon, Hyeon
    Park, Gwanmo
    Kim, Dae Hyun
    Kim, Nam Wook
    Kim, Juho
    Seo, Jinwook
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
  • [10] An astronomical question answering dataset for evaluating large language models
    Li, Jie
    Zhao, Fuyong
    Chen, Panfeng
    Xie, Jiafu
    Zhang, Xiangrui
    Li, Hui
    Chen, Mei
    Wang, Yanhao
    Zhu, Ming
    SCIENTIFIC DATA, 2025, 12 (01)