Training Data Selection for Record Linkage Classification

被引:0
|
作者
Ali Omar, Zaturrawiah [1 ]
Zamzuri, Zamira Hasanah [2 ]
Ariff, Noratiqah Mohd [2 ]
Abu Bakar, Mohd Aftar [2 ]
机构
[1] Univ Malaysia Sabah, Fac Sci & Nat Resources, Math Comp Graph Programme, Jalan UMS, Kota Kinabalu 88400, Malaysia
[2] Univ Kebangsaan Malaysia, Fac Sci & Technol, Dept Math Sci, Bangi 43600, Malaysia
来源
SYMMETRY-BASEL | 2023年 / 15卷 / 05期
关键词
record linkage; unsupervised random forest; similarity measure; training data;
D O I
10.3390/sym15051060
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F-1-score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F-1-score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] The effect of data cleaning on record linkage quality
    Randall, Sean M.
    Ferrante, Anna M.
    Boyd, James H.
    Semmens, James B.
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2013, 13
  • [22] Linking individual data: Methods of record linkage
    RumeauRouquette, C
    [J]. REVUE D EPIDEMIOLOGIE ET DE SANTE PUBLIQUE, 1997, 45 (03): : 248 - 256
  • [23] An Ensemble Approach for Record Matching in Data Linkage
    Poon, Simon K.
    Poon, Josiah
    Lam, Mary K.
    Yin, Qinglan
    Sze, Daniel M-Y.
    Wu, Justin C. Y.
    Mok, Vincent C. T.
    Ching, Jessica Y. L.
    Chan, Kam-Leung
    Cheung, William H. N.
    Lau, Alexander Y.
    [J]. DIGITAL HEALTH INNOVATION FOR CONSUMERS, CLINICIANS, CONNECTIVITY AND COMMUNITY, 2016, 227 : 113 - 119
  • [24] RECORD LINKAGE OF PRESCRIPTIONS AND DIAGNOSES RELATED DATA
    LEUFKENS, HGM
    BUURMA, H
    ARNOU, PG
    VANDERWAART, MAC
    [J]. PHARMACEUTISCH WEEKBLAD-SCIENTIFIC EDITION, 1987, 9 (02) : 141 - 141
  • [25] A Probabilistic Record Linkage Model for Survival Data
    Hof, Michel H.
    Ravelli, Anita C.
    Zwinderman, Aeilko H.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (520) : 1504 - 1515
  • [26] Efficient record linkage in large data sets
    Jin, L
    Li, C
    Mehrotra, S
    [J]. EIGHTH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2003, : 137 - 146
  • [27] Linkage of routinely collected data in practice: the Centre for Health Record Linkage
    Irvine, Katie A.
    Moore, Elizabeth A.
    [J]. PUBLIC HEALTH RESEARCH & PRACTICE, 2015, 25 (04):
  • [28] Impact of variations in Anonymous Record Linkage on Weight Distribution and Classification
    Nasseh, Daniel
    Stausberg, Juergen
    [J]. MEDINFO 2013: PROCEEDINGS OF THE 14TH WORLD CONGRESS ON MEDICAL AND HEALTH INFORMATICS, PTS 1 AND 2, 2013, 192 : 922 - 922
  • [29] Unsupervised Selection of Training Samples for Tree Species Classification Using Hyperspectral Data
    Dalponte, Michele
    Ene, Liviu Theodor
    Orka, Hans Ole
    Gobakken, Terje
    Naesset, Erik
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2014, 7 (08) : 3560 - 3569
  • [30] Fast Bayesian Record Linkage for Streaming Data Contexts
    Taylor, Ian
    Kaplan, Andee
    Betancourt, Brenda
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2024, 33 (03) : 833 - 844