IterClean: An Iterative Data Cleaning Framework with Large Language Models

被引:0
|
作者
Ni, Wei [1 ,3 ]
Zhang, Kaihang [1 ]
Miao, Xiaoye [1 ,2 ]
Zhao, Xiangyu [3 ]
Wu, Yangyang [4 ]
Yin, Jianwei [5 ]
机构
[1] Zhejiang Univ, Ctr Data Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, State Key Lab Brain Machine Intelligence, Hangzhou, Peoples R China
[3] City Univ Hong Kong, Sch Data Sci, Hong Kong, Peoples R China
[4] Zhejiang Univ, Software Coll, Ningbo, Peoples R China
[5] Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
关键词
Data cleaning; error detection; error repair; large language models; REPAIRS;
D O I
10.1145/3674399.3674436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the era of generative artificial intelligence, the accuracy of data is paramount. Erroneous data often leads to faulty outcomes and economic detriments. Previous cleaning methods employ a sequential detect-repair paradigm, leaving over half of the errors unsolved in real scenarios. We introduce IterClean, an iterative data cleaning framework leveraging large language models (LLMs). Utilizing an iterative mechanism, the framework employs a two-step process: data labeling and iterative data cleaning. With few labeled data, IterClean leverages an iterative cleaning process involving an error detector, an error verifier, and an error repairer to significantly enhance the cleaning performance. Extensive experiments across four datasets demonstrate that, IterClean achieves an F1 score that is up to three times higher than the best state-of-the-art approaches requiring only 5 labeled tuples.
引用
收藏
页码:100 / 105
页数:6
相关论文
共 50 条
  • [1] Cleaning Semi-Structured Errors in Open Data Using Large Language Models
    Mondal, Manuel
    Audiffren, Julien
    Dolamic, Ljiljana
    Bovet, Gerome
    Cudre-Mauroux, Philippe
    2024 11TH IEEE SWISS CONFERENCE ON DATA SCIENCE, SDS 2024, 2024, : 258 - 261
  • [2] Models for Distributed, Large Scale Data Cleaning
    Maccio, Vincent J.
    Chiang, Fei
    Down, Douglas G.
    TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING, 2014, 8643 : 369 - 380
  • [3] Creating Suspenseful Stories: Iterative Planning with Large Language Models
    Xie, Kaige
    Riedl, Mark
    PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 2391 - 2407
  • [4] Generating Data for Symbolic Language with Large Language Models
    Ye, Jiacheng
    Li, Chengzu
    Kong, Lingpeng
    Yu, Tao
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 8418 - 8443
  • [5] EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
    Zhou, Weikang
    Wang, Xiao
    Xiong, Limao
    Xia, Han
    Gu, Yingshuang
    Chai, Mingxu
    Zhu, Fukang
    Huang, Caishuang
    Dou, Shihan
    Xi, Zhiheng
    Zheng, Rui
    Gao, Songyang
    Zou, Yicheng
    Yan, Hang
    Le, Yifan
    Wang, Ruohui
    Li, Lijun
    Shao, Jing
    Gui, Tao
    Zhang, Qi
    Huang, Xuanjing
    arXiv,
  • [6] A Superalignment Framework in Autonomous Driving with Large Language Models
    Kong, Xiangrui
    Braunl, Thomas
    Fahmi, Marco
    Wang, Yue
    2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 1715 - 1720
  • [7] Demystifying Data Management for Large Language Models
    Miao, Xupeng
    Jia, Zhihao
    Cui, Bin
    COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 547 - 555
  • [8] Embracing the illusion of explanatory depth: A strategic framework for using iterative prompting for integrating large language models in healthcare education
    Mehta, Seysha
    Mehta, Neil
    MEDICAL TEACHER, 2025, 47 (02) : 208 - 211
  • [9] An iterative refinement approach for data cleaning
    Karmaker, Amitava
    Kwek, Stephen
    INTELLIGENT DATA ANALYSIS, 2007, 11 (05) : 547 - 560
  • [10] Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models
    Ko, Hyung-Kwon
    Jeon, Hyeon
    Park, Gwanmo
    Kim, Dae Hyun
    Kim, Nam Wook
    Kim, Juho
    Seo, Jinwook
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,