Causal Dataset Discovery with Large Language Models

被引：0

作者：

Liu, Junfei ^{[1
]}

Sun, Shaotong ^{[1
]}

Nargesian, Fatemeh ^{[1
]}

机构：

[1] Univ Rochester, 601 Elmwood Ave, Rochester, NY 14627 USA

来源：

WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024 | 2024年

关键词：

SEARCH;

D O I：

10.1145/3665939.3665968

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with columns in a query dataset. Discovering causal links from large-scale repositories faces three major challenges: vast scale of data, inherent sparsity of causal links, and incompleteness of variables present. Identifying causal relationships among datasets is a complex and time-intensive task, especially because it requires joining datasets, to bring all variables together, before applying causal link discovery. In this paper, we introduce the Causal Dataset Discovery problem and propose a large language model (LLM)-based framework to discover potential pairwise causal links between columns from different datasets. We heuristically improve LLM's grasp of causality through prompting and fine-tuning and prevent the extreme imbalance in causal candidate distributions due to natural sparsity of causal connections. We create benchmarks specific to this task1, experimentally show that our framework achieves remarkable performance with GPT-3.5 and GPT-4. We summarize the distinctive behaviors of different LLM strategies, and discuss improvements for future research.

引用

页数：8

共 50 条

[1] WikiFactDiff: A Large, Realistic, and Temporally Adaptable Dataset for Atomic Factual Knowledge Update in Causal Language Models
Khodja, Hichem Ammar
Béchet, Frédéric
Brabant, Quentin
Nasr, Alexis
Lecorvé, Gwénolé
2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 2024, : 17614 - 17624
[2] This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
Garcia-Ferrero, Iker
Altuna, Begona
Alvez, Javier
Gonzalez-Dios, Itziar
Rigau, German
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8596 - 8615
[3] Understanding the Dataset Practitioners Behind Large Language Models
Qian, Crystal
Reif, Emily
Kahng, Minsuk
EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024, 2024,
[4] A Chinese Dataset for Evaluating the Safeguards in Large Language Models
Wang, Yuxia
Zhai, Zenan
Li, Haonan
Han, Xudong
Lin, Lizhi
Zhang, Zhenxuan
Zhao, Jingru
Nakov, Preslav
Baldwin, Timothy
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 3106 - 3119
[5] Intersectional Stereotypes in Large Language Models: Dataset and Analysis
Ma, Weicheng
Chiang, Brian
Wu, Tong
Wang, Lili
Vosoughi, Soroush
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8589 - 8597
[6] Large language models for causal hypothesis generation in science
Cohrs, Kai-Hendrik
Diaz, Emiliano
Sitokonstantinou, Vasileios
Varando, Gherardo
Camps-Valls, Gustau
MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2025, 6 (01):
[7] A Causal View of Entity Bias in (Large) Language Models
Wang, Fei
Mo, Wenjie
Wang, Yiwei
Zhou, Wenxuan
Chen, Muhao
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15173 - 15184
[8] LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
Luo, Yulin
An, Ruichuan
Zou, Bocheng
Tang, Yiming
Liu, Jiaming
Zhang, Shanghang
COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 235 - 252
[9] Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models
Ko, Hyung-Kwon
Jeon, Hyeon
Park, Gwanmo
Kim, Dae Hyun
Kim, Nam Wook
Kim, Juho
Seo, Jinwook
PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
[10] An astronomical question answering dataset for evaluating large language models
Li, Jie
Zhao, Fuyong
Chen, Panfeng
Xie, Jiafu
Zhang, Xiangrui
Li, Hui
Chen, Mei
Wang, Yanhao
Zhu, Ming
SCIENTIFIC DATA, 2025, 12 (01)

← 1 2 3 4 5 →