Saibot: A Differentially Private Data Search Platform

被引:0
|
作者
Huang, Zezhou [1 ]
Liu, Jiaxiang [1 ]
Alabi, Daniel Gbenga [1 ]
Fernandez, Raul Castro [2 ]
Wu, Eugene [3 ]
机构
[1] Columbia Univ, New York, NY 10027 USA
[2] Univ Chicago, Chicago, IL USA
[3] Columbia Univ, DSI, New York, NY USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 11期
关键词
NOISE;
D O I
10.14778/3611479.3611508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for augmentations-join or union-compatible datasets-that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50-90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.
引用
收藏
页码:3057 / 3070
页数:14
相关论文
共 50 条
  • [1] Explode: An Extensible Platform for Differentially Private Data Analysis
    Esmerdag, Emir
    Gursoy, Mehmet Emre
    Inan, Ali
    Saygin, Yucel
    2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2016, : 1300 - 1303
  • [2] Differentially Private Auctions for Private Data Crowdsourcing
    Shi, Mingyu
    Qiao, Yu
    Wang, Xinbo
    2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1 - 8
  • [3] Differentially Private Data Generation with Missing Data
    Mohapatra, Shubhankar
    Zong, Jianqiao
    Kerschbaum, Florian
    He, Xi
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (08): : 2022 - 2035
  • [4] PCOR: Private Contextual Outlier Release via Differentially Private Search
    Shafieinejad, Masoumeh
    Kerschbaum, Florian
    Ilyas, Ihab F.
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 1571 - 1583
  • [5] DPGraph: A Benchmark Platform for Differentially Private Graph Analysis
    Xia, Siyuan
    Chang, Beizhen
    Knopf, Karl
    He, Yihan
    Tao, Yuchao
    He, Xi
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2808 - 2812
  • [6] PrivSyn: Differentially Private Data Synthesis
    Zhang, Zhikun
    Wang, Tianhao
    Li, Ninghui
    Honorio, Jean
    Backes, Michael
    He, Shibo
    Chen, Jiming
    Zhang, Yang
    PROCEEDINGS OF THE 30TH USENIX SECURITY SYMPOSIUM, 2021, : 929 - 946
  • [7] Differentially Private Topological Data Analysis
    Kang, Taegyu
    Kim, Sehwan
    Sohn, Jinwon
    Awan, Jordan
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25
  • [8] Differentially Private Multidimensional Data Publication
    Zhang Ji
    Dong Xin
    Yu Jiadi
    Luo Yuan
    Li Minglu
    Wu Bin
    CHINA COMMUNICATIONS, 2014, 11 (01) : 79 - 85
  • [9] Differentially Private Distributed Data Analysis
    Takabi, Hassan
    Koppikar, Samir
    Zargar, Saman Taghavi
    2016 IEEE 2ND INTERNATIONAL CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (IEEE CIC), 2016, : 212 - 218
  • [10] Differentially private multidimensional data publishing
    Al-Hussaeni, Khalil
    Fung, Benjamin C. M.
    Iqbal, Farkhund
    Liu, Junqiang
    Hung, Patrick C. K.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 56 (03) : 717 - 752