Saibot: A Differentially Private Data Search Platform

被引:0
|
作者
Huang, Zezhou [1 ]
Liu, Jiaxiang [1 ]
Alabi, Daniel Gbenga [1 ]
Fernandez, Raul Castro [2 ]
Wu, Eugene [3 ]
机构
[1] Columbia Univ, New York, NY 10027 USA
[2] Univ Chicago, Chicago, IL USA
[3] Columbia Univ, DSI, New York, NY USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 11期
关键词
NOISE;
D O I
10.14778/3611479.3611508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for augmentations-join or union-compatible datasets-that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50-90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.
引用
收藏
页码:3057 / 3070
页数:14
相关论文
共 50 条
  • [41] Differentially private distributed logistic regression using private and public data
    Zhanglong Ji
    Xiaoqian Jiang
    Shuang Wang
    Li Xiong
    Lucila Ohno-Machado
    BMC Medical Genomics, 7
  • [42] Differentially private distributed logistic regression using private and public data
    Ji, Zhanglong
    Jiang, Xiaoqian
    Wang, Shuang
    Xiong, Li
    Ohno-Machado, Lucila
    BMC MEDICAL GENOMICS, 2014, 7
  • [43] Private Sampling: A Noiseless Approach for Generating Differentially Private Synthetic Data
    Boedihardjo, March
    Strohmer, Thomas
    Vershynin, Roman
    SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE, 2022, 4 (03): : 1082 - 1115
  • [44] Differentially private data publication with multi -level data utility
    Jiang, Honglu
    Sarwar, S. M.
    Yu, Haotian
    Islam, Sheikh Ariful
    HIGH-CONFIDENCE COMPUTING, 2022, 2 (02):
  • [45] Preserving Data Utility in Differentially Private Smart Home Data
    Stirapongsasuti, Sopicha
    Tiausas, Francis Jerome
    Nakamura, Yugo
    Yasumoto, Keiichi
    IEEE ACCESS, 2024, 12 : 56571 - 56581
  • [46] Adaptive Differentially Private Data Release for Data Sharing and Data Mining
    Xiong, Li
    2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2013, : 891 - 891
  • [47] Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform
    Lecuyer, Mathias
    Spahn, Riley
    Vodrahalli, Kiran
    Geambasu, Roxana
    Hsu, Daniel
    PROCEEDINGS OF THE TWENTY-SEVENTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '19), 2019, : 181 - 195
  • [48] Privacy accounting and quality control in the sage differentially private ML platform
    Lècuyer M.
    Spahn R.
    Vodrahalli K.
    Geambasu R.
    Hsu D.
    Operating Systems Review (ACM), 2019, 53 (01): : 75 - 84
  • [49] PriSearch: Efficient Search on Private Data
    Riazi, M. Sadegh
    Songhori, Ebrahim M.
    Koushanfar, Farinaz
    PROCEEDINGS OF THE 2017 54TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2017,
  • [50] Differentially Private Data Release over Multiple Tables
    Ghazi, Badih
    Hu, Xiao
    Kumar, Ravi
    Manurangsi, Pasin
    PROCEEDINGS OF THE 42ND ACM SIGMOD-SIGACT-SIGAI SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, PODS 2023, 2023, : 207 - 219