A Method for Duplicate Record Detection Based on Decision Tree

被引:0
|
作者
Lin, Guangyan [1 ]
Qian, Yuxiang [1 ]
Zhang, Yiqiong [1 ]
机构
[1] Beihang Univ, Sch Software, Beijing, Peoples R China
关键词
Duplicate Detection; Decision Tree; Data Cleaning; Attribute Similarity; LINKAGE;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Duplicates is a common problem that widely bothers information systems. When computing similarity of two records, it will be time consuming and complex if comparing attributes one by one. This paper proposed a duplicate detection method based on decision tree. A conclusion of attribute similarity algorithms for common data types was made first. Based on the above, through mapping attribute similarity to decision tree nodes, that whether two records are duplicates or not can be determined in advance without computing entire attributes. At the same time of ensuring precision, the time complexity can be reduced significantly. The precision of experiments achieve above 98% and the F score 97%.
引用
收藏
页码:146 / 150
页数:5
相关论文
共 50 条
  • [1] Efficient Duplicate Record Detection Based on Similarity Estimation
    Li, Mohan
    Wang, Hongzhi
    Li, Jianzhong
    Gao, Hong
    WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2010, 6184 : 595 - 607
  • [2] Duplicate record detection: A survey
    Elmagarmid, Ahmed K.
    Ipeirotis, Panagiotis G.
    Verykios, Vassilios S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (01) : 1 - 16
  • [3] DUPLICATE RECORD DETECTION FOR DATABASE CLEANSING
    Rehman, Mariam
    Esichaikul, Vatcharapon
    2009 SECOND INTERNATIONAL CONFERENCE ON MACHINE VISION, PROCEEDINGS, ( ICMV 2009), 2009, : 333 - 338
  • [4] A Similar Duplicate Record Detection Algorithm for Big Data Based on MapReduce
    Song R.
    Yu T.
    Chen Y.
    Chen Y.
    Xia B.
    Shanghai Jiaotong Daxue Xuebao/Journal of Shanghai Jiaotong University, 2018, 52 (02): : 214 - 221
  • [5] An intrusion detection method based on principal component analysis and decision tree
    Liu, Yong
    Sun, Dong-Hong
    Chen, You
    Wang, Wan-Shan
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2010, 31 (07): : 933 - 937
  • [6] Multifile Partitioning for Record Linkage and Duplicate Detection
    Aleshin-Guendel, Serge
    Sadinle, Mauricio
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2023, 118 (543) : 1786 - 1795
  • [7] Performance Analysis of Duplicate Record Detection Techniques
    Adil, Syed Hasan
    Ebrahim, Mansoor
    Ali, Syed Saad Azhar
    Raza, Kamran
    ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2019, 9 (05) : 4755 - 4758
  • [8] A decision tree-based method for speech processing:: Question sentence detection
    Quang, Vu Minh
    Castelli, Eric
    Yen, Pham Ngoc
    FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4223 : 1205 - 1212
  • [9] Automation of duplicate record detection for systematic reviews: Deduplicator
    Forbes, Connor
    Greenwood, Hannah
    Carter, Matt
    Clark, Justin
    SYSTEMATIC REVIEWS, 2024, 13 (01)
  • [10] Malicious Domain Detection Based on Decision Tree
    Thein, Thin Tharaphe
    Shiraishi, Yoshiaki
    Morii, Masakatu
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2023, E106D (09) : 1490 - 1494