Privacy-Preserving Record Linkage for Cardinality Counting

被引:2
|
作者
Wu, Nan [1 ]
Vatsalan, Dinusha [2 ]
Kaafar, Mohamed Ali [2 ]
Ramesh, Sanath Kumar [3 ]
机构
[1] Macquarie Univ, CSIROs Data61, Sydney, Australia
[2] Macquarie Univ, Sydney, Australia
[3] CuresDev LLC, OpenTreatments Fdn, San Jose, CA USA
关键词
Probabilistic counting; distinct-counting; fuzzy matching; Bloom filters; unsupervised learning; differential privacy;
D O I
10.1145/3579856.3590338
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget epsilon = 1.0 compared to the state-of-the-art fuzzy matching and clustering method.
引用
收藏
页码:53 / 64
页数:12
相关论文
共 50 条
  • [1] Privacy-preserving record linkage
    Verykios, Vassilios S.
    Christen, Peter
    [J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2013, 3 (05) : 321 - 332
  • [2] Privacy-Preserving Record Linkage
    Hall, Rob
    Fienberg, Stephen E.
    [J]. PRIVACY IN STATISTICAL DATABASES, 2010, 6344 : 269 - +
  • [3] Privacy-Preserving Record Linkage with Spark
    Valkering, Onno
    Belloum, Adam
    [J]. 2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 440 - 448
  • [4] Privacy-Preserving Temporal Record Linkage
    Ranbaduge, Thilina
    Christen, Peter
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 377 - 386
  • [5] Privacy-preserving record linkage using autoencoders
    Victor Christen
    Tim Häntschel
    Peter Christen
    Erhard Rahm
    [J]. International Journal of Data Science and Analytics, 2023, 15 : 347 - 357
  • [6] A taxonomy of privacy-preserving record linkage techniques
    Vatsalan, Dinusha
    Christen, Peter
    Verykios, Vassilios S.
    [J]. INFORMATION SYSTEMS, 2013, 38 (06) : 946 - 969
  • [7] Privacy-preserving record linkage using autoencoders
    Christen, Victor
    Haentschel, Tim
    Christen, Peter
    Rahm, Erhard
    [J]. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2023, 15 (04) : 347 - 357
  • [8] Towards Privacy-Preserving Record Linkage with Record-Wise Linkage Policy
    Kaiho, Takahito
    Lu, Wen-jie
    Amagasa, Toshiyuki
    Sakuma, Jun
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2017, PT I, 2017, 10438 : 233 - 248
  • [9] A scalable privacy-preserving framework for temporal record linkage
    Ranbaduge, Thilina
    Christen, Peter
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (01) : 45 - 78
  • [10] Secure pseudonymisation for privacy-preserving probabilistic record linkage
    Smith, D.
    [J]. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2017, 34 : 271 - 279