IterClean: An Iterative Data Cleaning Framework with Large Language Models

被引:0
|
作者
Ni, Wei [1 ,3 ]
Zhang, Kaihang [1 ]
Miao, Xiaoye [1 ,2 ]
Zhao, Xiangyu [3 ]
Wu, Yangyang [4 ]
Yin, Jianwei [5 ]
机构
[1] Zhejiang Univ, Ctr Data Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, State Key Lab Brain Machine Intelligence, Hangzhou, Peoples R China
[3] City Univ Hong Kong, Sch Data Sci, Hong Kong, Peoples R China
[4] Zhejiang Univ, Software Coll, Ningbo, Peoples R China
[5] Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
关键词
Data cleaning; error detection; error repair; large language models; REPAIRS;
D O I
10.1145/3674399.3674436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the era of generative artificial intelligence, the accuracy of data is paramount. Erroneous data often leads to faulty outcomes and economic detriments. Previous cleaning methods employ a sequential detect-repair paradigm, leaving over half of the errors unsolved in real scenarios. We introduce IterClean, an iterative data cleaning framework leveraging large language models (LLMs). Utilizing an iterative mechanism, the framework employs a two-step process: data labeling and iterative data cleaning. With few labeled data, IterClean leverages an iterative cleaning process involving an error detector, an error verifier, and an error repairer to significantly enhance the cleaning performance. Extensive experiments across four datasets demonstrate that, IterClean achieves an F1 score that is up to three times higher than the best state-of-the-art approaches requiring only 5 labeled tuples.
引用
收藏
页码:100 / 105
页数:6
相关论文
共 50 条
  • [31] An Intent-based Networks Framework based on Large Language Models
    Fuad, Ahlam
    Ahmed, Azza H.
    Riegler, Michael A.
    Cicic, Tarik
    2024 IEEE 10TH INTERNATIONAL CONFERENCE ON NETWORK SOFTWARIZATION, NETSOFT 2024, 2024, : 7 - 12
  • [32] Quartet: A Holistic Hybrid Parallel Framework for Training Large Language Models
    Zhang, Weigang
    Zhou, Biyu
    Wu, Xing
    Gao, Chaochen
    Liu, Zhibing
    Tang, Xuehai
    Li, Ruixuan
    Han, Jizhong
    Hu, Songlin
    EURO-PAR 2024: PARALLEL PROCESSING, PART II, EURO-PAR 2024, 2024, 14802 : 424 - 438
  • [33] A Framework for Enhancing Statute Law Retrieval Using Large Language Models
    Pham, Trang Ngoc Anh
    Do, Dinh-Truong
    Nguyen, Minh Le
    NEW FRONTIERS IN ARTIFICIAL INTELLIGENCE, JSAI-ISAI 2024, 2024, 14741 : 247 - 259
  • [34] UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models
    Liu, Qi
    He, Yongyi
    Xu, Tong
    Lian, Defu
    Liu, Che
    Zheng, Zhi
    Chen, Enhong
    International Conference on Information and Knowledge Management, Proceedings, : 1909 - 1919
  • [35] FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models
    Zhang, Huaiwen
    Chen, Yu
    Wang, Ming
    Feng, Shi
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XIII, ICIC 2024, 2024, 14874 : 96 - 107
  • [36] A Review of Large Language Models in Healthcare: Taxonomy, Threats, Vulnerabilities, and Framework
    Hamid, Rida
    Brohi, Sarfraz
    BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (11)
  • [37] Large Language Models for Tabular Data: Progresses and Future Directions
    Dong, Haoyu
    Wang, Zhiruo
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2997 - 3000
  • [38] Incorporating Citizen-Generated Data into Large Language Models
    Vadapalli, Jagadeesh
    Gupta, Srishti
    Karki, Bishwa
    Tsai, Chun-Hua
    PROCEEDINGS OF THE 25TH ANNUAL INTERNATIONAL CONFERENCE ON DIGITAL GOVERNMENT RESEARCH, DGO 2024, 2024, : 1023 - 1025
  • [39] How to Protect Copyright Data in Optimization of Large Language Models?
    Chu, Timothy
    Song, Zhao
    Yang, Chiwun
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17871 - 17879
  • [40] The LLUNATIC Data-Cleaning Framework
    Geerts, Floris
    Mecca, Giansalvatore
    Papotti, Paolo
    Santoro, Donatello
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (09): : 625 - 636