Saibot: A Differentially Private Data Search Platform

被引:0
|
作者
Huang, Zezhou [1 ]
Liu, Jiaxiang [1 ]
Alabi, Daniel Gbenga [1 ]
Fernandez, Raul Castro [2 ]
Wu, Eugene [3 ]
机构
[1] Columbia Univ, New York, NY 10027 USA
[2] Univ Chicago, Chicago, IL USA
[3] Columbia Univ, DSI, New York, NY USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 11期
关键词
NOISE;
D O I
10.14778/3611479.3611508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for augmentations-join or union-compatible datasets-that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50-90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.
引用
收藏
页码:3057 / 3070
页数:14
相关论文
共 50 条
  • [21] Differentially Private Distance Learning in Categorical Data
    Battaglia, Elena
    Celano, Simone
    Pensa, Ruggero G.
    DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 35 (05) : 2050 - 2088
  • [22] Differentially private Bayesian learning on distributed data
    Heikkila, Mikko
    Lagerspetz, Eemil
    Kaski, Samuel
    Shimizu, Kana
    Tarkoma, Sasu
    Honkela, Antti
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [23] Differentially Private Publication Scheme for Trajectory Data
    Li, Meng
    Zhu, Liehuang
    Zhang, Zijian
    Xu, Rixin
    2016 IEEE FIRST INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE (DSC 2016), 2016, : 596 - 601
  • [24] Differentially private response mechanisms on categorical data
    Holohan, Naoise
    Leith, Douglas J.
    Mason, Oliver
    DISCRETE APPLIED MATHEMATICS, 2016, 211 : 86 - 98
  • [25] A Differentially Private Method for Crowdsourcing Data Submission
    Zhang, Lefeng
    Xiong, Ping
    Zhu, Tianqing
    TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING: PAKDD 2018 WORKSHOPS, 2018, 11154 : 142 - 148
  • [26] Differentially Private Learning with Small Public Data
    Wang, Jun
    Zhou, Zhi-Hua
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 6219 - 6226
  • [27] Publishing Differentially Private Medical Events Data
    Shaked, Sigal
    Rokach, Lior
    AVAILABILITY, RELIABILITY, AND SECURITY IN INFORMATION SYSTEMS, CD-ARES 2016, PAML 2016, 2016, 9817 : 219 - 235
  • [28] Differentially Private Distance Learning in Categorical Data
    Elena Battaglia
    Simone Celano
    Ruggero G. Pensa
    Data Mining and Knowledge Discovery, 2021, 35 : 2050 - 2088
  • [29] A differentially private algorithm for location data release
    Xiong, Ping
    Zhu, Tianqing
    Niu, Wenjia
    Li, Gang
    KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 47 (03) : 647 - 669
  • [30] Differentially Private Federated Learning on Heterogeneous Data
    Noble, Maxence
    Bellet, Aurelien
    Dieuleveut, Aymeric
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151