IterClean: An Iterative Data Cleaning Framework with Large Language Models

被引:0
|
作者
Ni, Wei [1 ,3 ]
Zhang, Kaihang [1 ]
Miao, Xiaoye [1 ,2 ]
Zhao, Xiangyu [3 ]
Wu, Yangyang [4 ]
Yin, Jianwei [5 ]
机构
[1] Zhejiang Univ, Ctr Data Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, State Key Lab Brain Machine Intelligence, Hangzhou, Peoples R China
[3] City Univ Hong Kong, Sch Data Sci, Hong Kong, Peoples R China
[4] Zhejiang Univ, Software Coll, Ningbo, Peoples R China
[5] Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
关键词
Data cleaning; error detection; error repair; large language models; REPAIRS;
D O I
10.1145/3674399.3674436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the era of generative artificial intelligence, the accuracy of data is paramount. Erroneous data often leads to faulty outcomes and economic detriments. Previous cleaning methods employ a sequential detect-repair paradigm, leaving over half of the errors unsolved in real scenarios. We introduce IterClean, an iterative data cleaning framework leveraging large language models (LLMs). Utilizing an iterative mechanism, the framework employs a two-step process: data labeling and iterative data cleaning. With few labeled data, IterClean leverages an iterative cleaning process involving an error detector, an error verifier, and an error repairer to significantly enhance the cleaning performance. Extensive experiments across four datasets demonstrate that, IterClean achieves an F1 score that is up to three times higher than the best state-of-the-art approaches requiring only 5 labeled tuples.
引用
收藏
页码:100 / 105
页数:6
相关论文
共 50 条
  • [41] Data science opportunities of large language models for neuroscience and biomedicine
    Bzdok, Danilo
    Thieme, Andrew
    Levkovskyy, Oleksiy
    Wren, Paul
    Ray, Thomas
    Reddy, Siva
    NEURON, 2024, 112 (05) : 698 - 717
  • [42] Large language models and synthetic health data: progress and prospects
    Smolyak, Daniel
    Bjarnadottir, Margret, V
    Crowley, Kenyon
    Agarwal, Ritu
    JAMIA OPEN, 2024, 7 (04)
  • [43] Bridging the data gap between children and large language models
    Frank, Michael C.
    TRENDS IN COGNITIVE SCIENCES, 2023, 27 (11) : 990 - 992
  • [44] QueryMintAI: Multipurpose Multimodal Large Language Models for Personal Data
    Ghosh, Ananya
    Deepa, K.
    IEEE ACCESS, 2024, 12 : 144631 - 144651
  • [45] A Method for Efficient Structured Data Generation with Large Language Models
    Hou, Zongzhi
    Zhao, Ruohan
    Li, Zhongyang
    Wang, Zheng
    Wu, Yizhen
    Gou, Junwei
    Zhu, Zhifeng
    PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024, 2024, : 36 - 44
  • [46] Using Large Language Models to Enhance the Reusability of Sensor Data
    Berenguer, Alberto
    Morejon, Adriana
    Tomas, David
    Mazon, Jose-Norberto
    SENSORS, 2024, 24 (02)
  • [47] Data augmented large language models for medical record generation
    Zhang, Xuanyi
    Zhao, Genghong
    Ren, Yi
    Wang, Weiguang
    Cai, Wei
    Zhao, Yan
    Zhang, Xia
    Liu, Jiren
    APPLIED INTELLIGENCE, 2025, 55 (02)
  • [48] A Framework for Distributed Cleaning of Data Streams
    Gill, Saul
    Lee, Brian
    6TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT-2015), THE 5TH INTERNATIONAL CONFERENCE ON SUSTAINABLE ENERGY INFORMATION TECHNOLOGY (SEIT-2015), 2015, 52 : 1186 - 1191
  • [49] Large Language Models are Not Models of Natural Language: They are Corpus Models
    Veres, Csaba
    IEEE ACCESS, 2022, 10 : 61970 - 61979
  • [50] Large Language Models
    Vargas, Diego Collarana
    Katsamanis, Nassos
    ERCIM NEWS, 2024, (136): : 12 - 13