IterClean: An Iterative Data Cleaning Framework with Large Language Models

被引:0
|
作者
Ni, Wei [1 ,3 ]
Zhang, Kaihang [1 ]
Miao, Xiaoye [1 ,2 ]
Zhao, Xiangyu [3 ]
Wu, Yangyang [4 ]
Yin, Jianwei [5 ]
机构
[1] Zhejiang Univ, Ctr Data Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, State Key Lab Brain Machine Intelligence, Hangzhou, Peoples R China
[3] City Univ Hong Kong, Sch Data Sci, Hong Kong, Peoples R China
[4] Zhejiang Univ, Software Coll, Ningbo, Peoples R China
[5] Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
关键词
Data cleaning; error detection; error repair; large language models; REPAIRS;
D O I
10.1145/3674399.3674436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the era of generative artificial intelligence, the accuracy of data is paramount. Erroneous data often leads to faulty outcomes and economic detriments. Previous cleaning methods employ a sequential detect-repair paradigm, leaving over half of the errors unsolved in real scenarios. We introduce IterClean, an iterative data cleaning framework leveraging large language models (LLMs). Utilizing an iterative mechanism, the framework employs a two-step process: data labeling and iterative data cleaning. With few labeled data, IterClean leverages an iterative cleaning process involving an error detector, an error verifier, and an error repairer to significantly enhance the cleaning performance. Extensive experiments across four datasets demonstrate that, IterClean achieves an F1 score that is up to three times higher than the best state-of-the-art approaches requiring only 5 labeled tuples.
引用
收藏
页码:100 / 105
页数:6
相关论文
共 50 条
  • [21] Extracting Training Data from Large Language Models
    Carlini, Nicholas
    Tramer, Florian
    Wallace, Eric
    Jagielski, Matthew
    Herbert-Voss, Ariel
    Lee, Katherine
    Roberts, Adam
    Brown, Tom
    Song, Dawn
    Erlingsson, Ulfar
    Oprea, Alina
    Raffel, Colin
    PROCEEDINGS OF THE 30TH USENIX SECURITY SYMPOSIUM, 2021, : 2633 - 2650
  • [22] How Large Language Models Will Disrupt Data Management
    Fernandez, Raul Castro
    Elmore, Aaron J.
    Franklin, Michael J.
    Krishnan, Sanjay
    Tan, Chenhao
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (11): : 3302 - 3309
  • [23] Leveraging Large Language Models for Sensor Data Retrieval
    Berenguer, Alberto
    Morejon, Adriana
    Tomas, David
    Mazon, Jose-Norberto
    APPLIED SCIENCES-BASEL, 2024, 14 (06):
  • [24] Empowering Large Language Models for Textual Data Augmentation
    Li, Yichuan
    Ding, Kaize
    Wang, Jianling
    Lee, Kyumin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 12734 - 12751
  • [25] Making Large Language Models Better Data Creators
    Lee, Dong-Ho
    Pujara, Jay
    Sewak, Mohit
    White, Ryen W.
    Jauhar, Sujay Kumar
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 15349 - 15360
  • [26] Automating Qualitative Data Analysis with Large Language Models
    Parfenova, Angelina
    Denzler, Alexander
    Pfeffer, Juergen
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 101 - 109
  • [27] Leveraging large language models for data analysis automation
    Jansen, Jacqueline A.
    Manukyan, Artur
    Al Khoury, Nour
    Akalin, Altuna
    PLOS ONE, 2025, 20 (02):
  • [28] Cleaning Framework for BigData - AN INTERACTIVE APPROACH FOR DATA CLEANING
    Liu, Hong
    Kumar, Ashwin T. K.
    Thomas, Johnson P.
    Hou, Xiaofei
    PROCEEDINGS 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2016), 2016, : 174 - 181
  • [29] DPDLLM: A Black-box Framework for Detecting Pre-training Data from Large Language Models
    Zhou, Baohang
    Wang, Zezhong
    Wang, Lingzhi
    Wang, Hongru
    Zhang, Ying
    Song, Kehui
    Su, Xuhui
    Wong, Kam-Fai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 644 - 653
  • [30] A framework for neurosymbolic robot action planning using large language models
    Capitanelli, Alessio
    Mastrogiovanni, Fulvio
    FRONTIERS IN NEUROROBOTICS, 2024, 18