Saibot: A Differentially Private Data Search Platform

被引:0
|
作者
Huang, Zezhou [1 ]
Liu, Jiaxiang [1 ]
Alabi, Daniel Gbenga [1 ]
Fernandez, Raul Castro [2 ]
Wu, Eugene [3 ]
机构
[1] Columbia Univ, New York, NY 10027 USA
[2] Univ Chicago, Chicago, IL USA
[3] Columbia Univ, DSI, New York, NY USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 11期
关键词
NOISE;
D O I
10.14778/3611479.3611508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for augmentations-join or union-compatible datasets-that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50-90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.
引用
收藏
页码:3057 / 3070
页数:14
相关论文
共 50 条
  • [31] Differentially private publication of streaming trajectory data
    Ding, Xiaofeng
    Zhou, Wenxiang
    Sheng, Shujun
    Bao, Zhifeng
    Choo, Kim-Kwang Raymond
    Jin, Hai
    INFORMATION SCIENCES, 2020, 538 : 159 - 175
  • [32] Differentially Private Outlier Detection in Correlated Data
    Degue, Kwassi H.
    Gopalakrishnan, Karthik
    Li, Max Z.
    Balakrishnan, Hamsa
    2021 60TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2021, : 2735 - 2742
  • [33] Differentially Private Data Releasing for Smooth Queries
    Wang, Ziteng
    Jin, Chi
    Fan, Kai
    Zhang, Jiaqi
    Huang, Junliang
    Zhong, Yiqiao
    Wang, Liwei
    JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17 : 1 - 42
  • [34] PrivPfC: differentially private data publication for classification
    Su, Dong
    Cao, Jianneng
    Li, Ninghui
    Lyu, Min
    VLDB JOURNAL, 2018, 27 (02): : 201 - 223
  • [35] Algorithmically Effective Differentially Private Synthetic Data
    He, Yiyun
    Vershynin, Roman
    Zhu, Yizhe
    THIRTY SIXTH ANNUAL CONFERENCE ON LEARNING THEORY, VOL 195, 2023, 195
  • [36] Differentially Private Ensemble Classifiers for Data Streams
    Gondara, Lovedeep
    Wang, Ke
    Carvalho, Ricardo Silva
    WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2022, : 325 - 333
  • [37] A differentially private algorithm for location data release
    Ping Xiong
    Tianqing Zhu
    Wenjia Niu
    Gang Li
    Knowledge and Information Systems, 2016, 47 : 647 - 669
  • [38] Investigating Visual Analysis of Differentially Private Data
    Zhang, Dan
    Sarvghad, Ali
    Miklau, Gerome
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2021, 27 (02) : 1786 - 1796
  • [39] Differentially Private Data Publishing and Analysis: A Survey
    Zhu, Tianqing
    Li, Gang
    Zhou, Wanlei
    Yu, Philip S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (08) : 1619 - 1638
  • [40] Using Noisy Binary Search for Differentially Private Anomaly Detection
    Bittner, Daniel M.
    Sarwate, Anand D.
    Wright, Rebecca N.
    CYBER SECURITY CRYPTOGRAPHY AND MACHINE LEARNING, CSCML 2018, 2018, 10879 : 20 - 37