Training Data Selection for Record Linkage Classification

被引:0
|
作者
Ali Omar, Zaturrawiah [1 ]
Zamzuri, Zamira Hasanah [2 ]
Ariff, Noratiqah Mohd [2 ]
Abu Bakar, Mohd Aftar [2 ]
机构
[1] Univ Malaysia Sabah, Fac Sci & Nat Resources, Math Comp Graph Programme, Jalan UMS, Kota Kinabalu 88400, Malaysia
[2] Univ Kebangsaan Malaysia, Fac Sci & Technol, Dept Math Sci, Bangi 43600, Malaysia
来源
SYMMETRY-BASEL | 2023年 / 15卷 / 05期
关键词
record linkage; unsupervised random forest; similarity measure; training data;
D O I
10.3390/sym15051060
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F-1-score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F-1-score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] A Unified Record Linkage Strategy for Web Service Data
    Kan, Qin
    Yang, Yujiu
    Zhen, Shiqiang
    Liu, Wenhuang
    [J]. THIRD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING: WKDD 2010, PROCEEDINGS, 2010, : 253 - 256
  • [32] A Bayesian record linkage model incorporating relational data
    Sosa, Juan
    Rodriguez, Abel
    [J]. APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 2023, 39 (06) : 755 - 771
  • [33] Record linkage strategies, outpatient procedures, and administrative data
    Roos, LL
    Walld, R
    Wajda, A
    Bond, R
    Hartford, K
    [J]. MEDICAL CARE, 1996, 34 (06) : 570 - 582
  • [34] Effective record linkage for mining campaign contribution data
    Giraud-Carrier, C.
    Goodliffe, J.
    Jones, B. M.
    Cueva, S.
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 45 (02) : 389 - 416
  • [35] Improved quality of tuberculosis data using record linkage
    Bartholomay, Patricia
    de Oliveira, Gisele Pinto
    Pinheiro, Rejane Sobrino
    Nogales Vasconcelos, Ana Maria
    [J]. CADERNOS DE SAUDE PUBLICA, 2014, 30 (11): : 2459 - 2469
  • [36] Linkability measures to assess the data characteristics for record linkage
    Ong, Toan C.
    Hill, Andrew
    Kahn, Michael G.
    Lembcke, Lauren R.
    Schilling, Lisa M.
    Grannis, Shaun J.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024,
  • [37] A FORMALIZATION OF RECORD LINKAGE AND ITS APPLICATION TO DATA PROTECTION
    Torra, Vicenc
    Stokes, Klara
    [J]. INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2012, 20 (06) : 907 - 919
  • [38] Effective record linkage for mining campaign contribution data
    C. Giraud-Carrier
    J. Goodliffe
    B. M. Jones
    S. Cueva
    [J]. Knowledge and Information Systems, 2015, 45 : 389 - 416
  • [39] Data-driven name reduction for record linkage
    Schraagen, Marijn
    Kosters, Walter
    [J]. 2012 SECOND INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING TECHNOLOGY (INTECH), 2012, : 311 - 316
  • [40] Hybrid Record Linkage Model for Integrating Marine Data
    Fitrianah, Devi
    Wasito, Ito
    [J]. INTERNATIONAL CONFERENCE ON ADVANCES SCIENCE AND CONTEMPORARY ENGINEERING 2012, 2012, 50 : 926 - 932