Causal Dataset Discovery with Large Language Models

被引：0

作者：

Liu, Junfei ^{[1
]}

Sun, Shaotong ^{[1
]}

Nargesian, Fatemeh ^{[1
]}

机构：

[1] Univ Rochester, 601 Elmwood Ave, Rochester, NY 14627 USA

来源：

WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024 | 2024年

关键词：

SEARCH;

D O I：

10.1145/3665939.3665968

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with columns in a query dataset. Discovering causal links from large-scale repositories faces three major challenges: vast scale of data, inherent sparsity of causal links, and incompleteness of variables present. Identifying causal relationships among datasets is a complex and time-intensive task, especially because it requires joining datasets, to bring all variables together, before applying causal link discovery. In this paper, we introduce the Causal Dataset Discovery problem and propose a large language model (LLM)-based framework to discover potential pairwise causal links between columns from different datasets. We heuristically improve LLM's grasp of causality through prompting and fine-tuning and prevent the extreme imbalance in causal candidate distributions due to natural sparsity of causal connections. We create benchmarks specific to this task1, experimentally show that our framework achieves remarkable performance with GPT-3.5 and GPT-4. We summarize the distinctive behaviors of different LLM strategies, and discuss improvements for future research.

引用

页数：8

共 50 条

[31] Causal Distillation for Language Models
Wu, Zhengxuan
Geiger, Atticus
Rozner, Joshua
Kreiss, Elisa
Lu, Hanson
Icard, Thomas
Potts, Christopher
Goodman, Noah
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4288 - 4295
[32] On Inter-Dataset Code Duplication and Data Leakage in Large Language Models
Lopez, Jose Antonio Hernandez
Chen, Boqi
Saad, Mootez
Sharma, Tushar
Varro, Daniel
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2025, 51 (01) : 192 - 205
[33] CR-LLM: A Dataset and Optimization for Concept Reasoning of Large Language Models
Li, Nianqi
Liu, Jingping
Jiang, Sihang
Jiang, Haiyun
Xiao, Yanghua
Liang, Jiaqing
Liang, Zujie
Wei, Feng
Chen, Jinglei
Hao, Zhenghong
Han, Bing
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 13737 - 13747
[34] DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
Li, Haodong
Zhang, Xiaofeng
Qu, Haicheng
REMOTE SENSING, 2025, 17 (04)
[35] Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System
Huang, Hengguan
Wang, Songtao
Liu, Hongfu
Wang, Hao
Wang, Ye
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1624 - 1637
[36] The discovery of causal models with small samples
Dai, HH
Korb, K
Wallace, C
ANZIIS 96 - 1996 AUSTRALIAN NEW ZEALAND CONFERENCE ON INTELLIGENT INFORMATION SYSTEMS, PROCEEDINGS, 1996, : 27 - 30
[37] DrugGen enhances drug discovery with large language models and reinforcement learning
Mahsa Sheikholeslami
Navid Mazrouei
Yousof Gheisari
Afshin Fasihi
Matin Irajpour
Ali Motahharynia
Scientific Reports, 15 (1)
[38] Large Language Models, the ‘Doctrine of Discovery’ and ‘Terra Nullius’ Declared Again?
Clear, Tony
ACM Inroads, 2024, 15 (02) : 6 - 9
[39] Leveraging Large Language Models for Enhancing Literature-Based Discovery
Taleb, Ikbal
Navaz, Alramzana Nujum
Serhani, Mohamed Adel
BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (11)
[40] Transformative Movie Discovery: Large Language Models for Recommendation and Genre Prediction
Raj, Subham
Sharma, Anurag
Saha, Sriparna
Singh, Brijraj
Pedanekar, Niranjan
IEEE ACCESS, 2024, 12 : 186626 - 186638

← 1 2 3 4 5 →