Reconstructing training data from document understanding models

被引:0
|
作者
Dentan, Jeremie [1 ,2 ]
Paran, Arnaud [1 ]
Shabou, Aymen [1 ]
机构
[1] Credit Agr SA, Montrouge, France
[2] IP Paris, Ecole Polytech, Paris, France
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Document understanding models are increasingly employed by companies to supplant humans in processing sensitive documents, such as invoices, tax notices, or even ID cards. However, the robustness of such models to privacy attacks remains vastly unexplored. This paper presents CDMI, the first reconstruction attack designed to extract sensitive fields from the training data of these models. We attack LayoutLM and BROS architectures, demonstrating that an adversary can perfectly reconstruct up to 4.1% of the fields of the documents used for fine-tuning, including some names, dates, and invoice amounts up to six-digit numbers. When our reconstruction attack is combined with a membership inference attack, our attack accuracy escalates to 22.5%. In addition, we introduce two new end-to-end metrics and evaluate our approach under various conditions: unimodal or bimodal data, LayoutLM or BROS backbones, four fine-tuning tasks, and two public datasets (FUNSD and SROIE). We also investigate the interplay between overfitting, predictive performance, and susceptibility to our attack. We conclude with a discussion on possible defenses against our attack and potential future research directions to construct robust document understanding models.
引用
收藏
页码:6813 / 6830
页数:18
相关论文
共 50 条
  • [41] Writer training: Complementary models of document review in the classroom and at work
    Swarts, J
    STC'S 50TH ANNUAL CONFERENCE, PROCEEDINGS, 2003, : 214 - 219
  • [42] Cross-Lingual Training of Neural Models for Document Ranking
    Shi, Peng
    Bai, He
    Lin, Jimmy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2768 - 2773
  • [43] Understanding bilingual memory: models and data
    French, RM
    Jacquet, M
    TRENDS IN COGNITIVE SCIENCES, 2004, 8 (02) : 87 - 93
  • [44] Understanding Vision: Theory, Models, and Data
    Durant, Szonya
    PERCEPTION, 2016, 45 (10) : 1207 - 1208
  • [45] Understanding Document Data Sources Using Ontologies with Referring Expressions
    Borgida, Alexander
    Franconi, Enrico
    Toman, David
    Weddell, Grant
    AI 2022: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, 13728 : 367 - 380
  • [46] An em algorithm for training wideband acoustic models from mixed-bandwidth training data
    Seltzer, ML
    Acero, A
    2005 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2005, : 197 - 202
  • [47] Protecting Machine Learning Models from Training Data Set Extraction
    Kalinin, M. O.
    Muryleva, A. A.
    Platonov, V. V.
    AUTOMATIC CONTROL AND COMPUTER SCIENCES, 2024, 58 (08) : 1234 - 1241
  • [48] Reconstructing source terms from environmental data
    Beck, HL
    ENVIRONMENTAL DOSE RECONSTRUCTION AND RISK IMPLICATIONS: NCRP - PROCEEDINGS OF THE THIRTY-FIRST ANNUAL MEETING, 12-13 APRIL, 1995, AS PRESENTED AT THE CRYSTAL CITY MARRIOTT, ARLINGTON, MA, ISSUED OCTOBER 1, 1996, 1996, (17): : 79 - 91
  • [49] Reconstructing signed relations from interaction data
    Georges Andres
    Giona Casiraghi
    Giacomo Vaccario
    Frank Schweitzer
    Scientific Reports, 13
  • [50] Reconstructing algebraic functions from mixed data
    Ar, S
    Lipton, RJ
    Rubinfeld, R
    Sudan, M
    SIAM JOURNAL ON COMPUTING, 1998, 28 (02) : 488 - 511