Reconstructing training data from document understanding models

被引:0
|
作者
Dentan, Jeremie [1 ,2 ]
Paran, Arnaud [1 ]
Shabou, Aymen [1 ]
机构
[1] Credit Agr SA, Montrouge, France
[2] IP Paris, Ecole Polytech, Paris, France
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Document understanding models are increasingly employed by companies to supplant humans in processing sensitive documents, such as invoices, tax notices, or even ID cards. However, the robustness of such models to privacy attacks remains vastly unexplored. This paper presents CDMI, the first reconstruction attack designed to extract sensitive fields from the training data of these models. We attack LayoutLM and BROS architectures, demonstrating that an adversary can perfectly reconstruct up to 4.1% of the fields of the documents used for fine-tuning, including some names, dates, and invoice amounts up to six-digit numbers. When our reconstruction attack is combined with a membership inference attack, our attack accuracy escalates to 22.5%. In addition, we introduce two new end-to-end metrics and evaluate our approach under various conditions: unimodal or bimodal data, LayoutLM or BROS backbones, four fine-tuning tasks, and two public datasets (FUNSD and SROIE). We also investigate the interplay between overfitting, predictive performance, and susceptibility to our attack. We conclude with a discussion on possible defenses against our attack and potential future research directions to construct robust document understanding models.
引用
收藏
页码:6813 / 6830
页数:18
相关论文
共 50 条
  • [1] Reconstructing Training Data from Diverse ML Models by Ensemble Inversion
    Wang, Qian
    Kurz, Daniel
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 3870 - 3878
  • [2] Reconstructing dynamical models from optogenetic data
    Sorinel A Oprisan
    Patrick E Lynn
    Tamas Tompa
    Antonieta Lavin
    BMC Neuroscience, 16 (Suppl 1)
  • [3] Reconstructing Training Data from Trained Neural Networks
    Haim, Niv
    Vardi, Gal
    Yehudai, Gilad
    Shamir, Ohad
    Irani, Michal
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [4] Reconstructing Training Data with Informed Adversaries
    Balle, Borja
    Cherubin, Giovanni
    Hayes, Jamie
    43RD IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2022), 2022, : 1138 - 1156
  • [5] Reconstructing individual hand models from motion capture data
    Endo, Yui
    Tada, Mitsunori
    Mochimaru, Masaaki
    JOURNAL OF COMPUTATIONAL DESIGN AND ENGINEERING, 2014, 1 (01) : 1 - 12
  • [6] Improving Speech Understanding Accuracy with Limited Training Data Using Multiple Language Models and Multiple Understanding Models
    Katsumaru, Masaki
    Nakano, Mikio
    Komatani, Kazunori
    Funakoshi, Kotaro
    Ogata, Tetsuya
    Okuno, Hiroshi G.
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2699 - +
  • [7] Experiments of Reconstructing Discrete Atmospheric Dynamic Models from Data (I)
    林振山
    朱焰宇
    邓自旺
    AdvancesinAtmosphericSciences, 1995, (01) : 121 - 125
  • [8] Extracting Training Data from Diffusion Models
    Carlini, Nicholas
    Hayes, Jamie
    Nasr, Milad
    Jagielski, Matthew
    Sehwag, Vikash
    Tramer, Florian
    Balle, Borja
    Ippolito, Daphne
    Wallace, Eric
    PROCEEDINGS OF THE 32ND USENIX SECURITY SYMPOSIUM, 2023, : 5253 - 5270
  • [9] Reconstructing historical habitat data with predictive models
    Zweig, Christa L.
    Kitchens, Wiley M.
    ECOLOGICAL APPLICATIONS, 2014, 24 (01) : 196 - 203
  • [10] Reconstructing Cenozoic vegetation from proxy data and models - A NECLIME synthesis (Editorial)
    Francois, Louis
    Bruch, Angela A.
    Utescher, Torsten
    Spicer, Robert A.
    Spicer, Teresa
    PALAEOGEOGRAPHY PALAEOCLIMATOLOGY PALAEOECOLOGY, 2017, 467 : 1 - 4