IterClean: An Iterative Data Cleaning Framework with Large Language Models

被引:0
|
作者
Ni, Wei [1 ,3 ]
Zhang, Kaihang [1 ]
Miao, Xiaoye [1 ,2 ]
Zhao, Xiangyu [3 ]
Wu, Yangyang [4 ]
Yin, Jianwei [5 ]
机构
[1] Zhejiang Univ, Ctr Data Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, State Key Lab Brain Machine Intelligence, Hangzhou, Peoples R China
[3] City Univ Hong Kong, Sch Data Sci, Hong Kong, Peoples R China
[4] Zhejiang Univ, Software Coll, Ningbo, Peoples R China
[5] Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
关键词
Data cleaning; error detection; error repair; large language models; REPAIRS;
D O I
10.1145/3674399.3674436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the era of generative artificial intelligence, the accuracy of data is paramount. Erroneous data often leads to faulty outcomes and economic detriments. Previous cleaning methods employ a sequential detect-repair paradigm, leaving over half of the errors unsolved in real scenarios. We introduce IterClean, an iterative data cleaning framework leveraging large language models (LLMs). Utilizing an iterative mechanism, the framework employs a two-step process: data labeling and iterative data cleaning. With few labeled data, IterClean leverages an iterative cleaning process involving an error detector, an error verifier, and an error repairer to significantly enhance the cleaning performance. Extensive experiments across four datasets demonstrate that, IterClean achieves an F1 score that is up to three times higher than the best state-of-the-art approaches requiring only 5 labeled tuples.
引用
收藏
页码:100 / 105
页数:6
相关论文
共 50 条
  • [11] A FRAMEWORK FOR DATA CLEANING IN DATA WAREHOUSES
    Peng, Taoxin
    ICEIS 2008: PROCEEDINGS OF THE TENTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL DISI: DATABASES AND INFORMATION SYSTEMS INTEGRATION, 2008, : 473 - 478
  • [12] Framework for evaluating code generation ability of large language models
    Yeo, Sangyeop
    Ma, Yu-Seung
    Kim, Sang Cheol
    Jun, Hyungkook
    Kim, Taeho
    ETRI JOURNAL, 2024, 46 (01) : 106 - 117
  • [13] A hybrid framework with large language models for rare disease phenotyping
    Wu, Jinge
    Dong, Hang
    Li, Zexi
    Wang, Haowei
    Li, Runci
    Patra, Arijit
    Dai, Chengliang
    Ali, Waqar
    Scordis, Phil
    Wu, Honghan
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
  • [14] DM_Integration: A framework for iterative large volume data integration
    Li, Junkui
    Wang, Yuanzhen
    Li, Zhuan
    PROCEEDINGS OF THE FIRST INTERNATIONAL SYMPOSIUM ON DATA, PRIVACY, AND E-COMMERCE, 2007, : 68 - 73
  • [15] A User-Centered Framework for Data Privacy Protection Using Large Language Models and Attention Mechanisms
    Zhou, Shutian
    Zhou, Zizhe
    Wang, Chenxi
    Liang, Yuzhe
    Wang, Liangyu
    Zhang, Jiahe
    Zhang, Jinming
    Lv, Chunli
    APPLIED SCIENCES-BASEL, 2024, 14 (15):
  • [16] Harnessing Large Language Models to Collect and Analyze Metal-Organic Framework Property Data Set
    Kang, Yeonghun
    Lee, Wonseok
    Bae, Taeun
    Han, Seunghee
    Jang, Huiwon
    Kim, Jihan
    JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 2025, 147 (05) : 3943 - 3958
  • [17] Exploring the Performance of Large Language Models for Data Analysis Tasks Through the CRISP-DM Framework
    Musazade, Nurlan
    Mezei, Jozsef
    Wang, Xiaolu
    GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 5, WORLDCIST 2024, 2024, 989 : 56 - 65
  • [18] The cognitive age in medicine: Artificial intelligence, large language models, and iterative intelligence
    Nosta, John
    AMERICAN JOURNAL OF HEMATOLOGY, 2024, 99 (12) : 2256 - 2257
  • [19] Multi-step Iterative Automated Domain Modeling with Large Language Models
    Yang, Yujing
    Chen, Boqi
    Chen, Kua
    Mussbacher, Gunter
    Varro, Daniel
    ACM/IEEE 27TH INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS: COMPANION PROCEEDINGS, MODELS 2024, 2024, : 587 - 595
  • [20] Dehallucinating Large Language Models Using Formal Methods Guided Iterative Prompting
    Jha, Susmit
    Jha, Sumit Kumar
    Lincoln, Patrick
    Bastian, Nathaniel D.
    Velasquez, Alvaro
    Neema, Sandeep
    2023 IEEE INTERNATIONAL CONFERENCE ON ASSURED AUTONOMY, ICAA, 2023, : 149 - 152