Training Data Selection for Record Linkage Classification

被引:0
|
作者
Ali Omar, Zaturrawiah [1 ]
Zamzuri, Zamira Hasanah [2 ]
Ariff, Noratiqah Mohd [2 ]
Abu Bakar, Mohd Aftar [2 ]
机构
[1] Univ Malaysia Sabah, Fac Sci & Nat Resources, Math Comp Graph Programme, Jalan UMS, Kota Kinabalu 88400, Malaysia
[2] Univ Kebangsaan Malaysia, Fac Sci & Technol, Dept Math Sci, Bangi 43600, Malaysia
来源
SYMMETRY-BASEL | 2023年 / 15卷 / 05期
关键词
record linkage; unsupervised random forest; similarity measure; training data;
D O I
10.3390/sym15051060
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F-1-score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F-1-score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Automatic training example selection for scalable unsupervised record linkage
    Christen, Peter
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 511 - 518
  • [2] Maximum entropy classification for record linkage
    Lee, Danhyang
    Zhang, Li-Chun
    Kim, Jae Kwang
    [J]. SURVEY METHODOLOGY, 2022, 48 (01) : 1 - 23
  • [3] RECORD LINKAGE AND DATA PROTECTION
    不详
    [J]. LANCET, 1985, 1 (8423): : 294 - 294
  • [4] Classification-Based Record Linkage With Pseudonymized Data for Epidemiological Cancer Registries
    Siegert, Yannik
    Jiang, Xiaoyi
    Krieg, Volker
    Bartholomaeus, Sebastian
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (10) : 1929 - 1941
  • [5] Variable selection for latent class analysis in the presence of missing data with application to record linkage
    Xu, Huiping
    Li, Xiaochun
    Zhang, Zuoyi
    Grannis, Shaun
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2024, 33 (06) : 966 - 980
  • [6] Support Vector Machine (SVM) Classification: Comparison of Linkage Techniques Using a Clustering-Based Method for Training Data Selection
    Su, Lihong
    Huang, Yuxia
    [J]. GISCIENCE & REMOTE SENSING, 2009, 46 (04) : 411 - 423
  • [7] Data quality and record linkage techniques
    Malik, Waqas Ahmed
    Unwin, Antony
    [J]. PSYCHOMETRIKA, 2008, 73 (01) : 165 - 166
  • [8] Study on Record Linkage of Anonymizied Data
    Kikuchi, Hiroaki
    Yamaguchi, Takayasu
    Hamada, Koki
    Yamaoka, Yuji
    Oguri, Hidenobu
    Sakuma, Jun
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2018, E101A (01) : 19 - 28
  • [9] Efficient Record Linkage in Data Streams
    Karapiperis, Dimitrios
    Gkoulalas-Divanis, Aris
    Verykios, Vassilios S.
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 523 - 532
  • [10] Data Quality and Record Linkage Techniques
    Larsen, Michael D.
    [J]. JOURNAL OF OFFICIAL STATISTICS, 2008, 24 (02) : 327 - 330