Enabling PII Discovery in Textual Data via Outlier Detection

被引:0
|
作者
Islam, Md Rakibul [1 ]
Kayem, Anne V. D. M. [2 ]
Meinel, Christoph [2 ]
机构
[1] Univ Potsdam, Dept Computat Sci, Potsdam, Germany
[2] Univ Potsdam, Hasso Plattner Inst Digital Engn, Potsdam, Germany
来源
DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II | 2023年 / 14147卷
关键词
Outlier Detection; Named Entity Recognition; Data Masking; Personal Identifying Information (PII);
D O I
10.1007/978-3-031-39821-6_17
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.
引用
收藏
页码:209 / 216
页数:8
相关论文
共 50 条
  • [41] Outlier detection in process plant data
    Chen, J.
    Bandoni, A.
    Romagnoli, J.A.
    Computers and Chemical Engineering, 1998, 22 (4 /5): : 641 - 646
  • [42] Universal outlier detection for PIV data
    Jerry Westerweel
    Fulvio Scarano
    Experiments in Fluids, 2005, 39 : 1096 - 1100
  • [43] Using data images for outlier detection
    Marchette, DJ
    Solka, JL
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2003, 43 (04) : 541 - 552
  • [44] Outlier detection for high dimensional data
    Aggarwal, CC
    Yu, PS
    SIGMOD RECORD, 2001, 30 (02) : 37 - 46
  • [45] Unsupervised outlier detection in multidimensional data
    Ur Rehman, Atiq
    Belhaouari, Samir Brahim
    JOURNAL OF BIG DATA, 2021, 8 (01)
  • [46] Outlier: Enabling Effective Measurement of Hypervisor Code Integrity With Group Detection
    Gu, Jianan
    Ma, Yukun
    Zheng, Beilei
    Weng, Chuliang
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2022, 19 (06) : 3686 - 3698
  • [47] Textual outlier detection with an unsupervised method using text similarity and density peak
    Sereshki, Mahnaz Taleb
    Zanjireh, Morteza Mohammadi
    Bahaghighat, Mahdi
    ACTA UNIVERSITATIS SAPIENTIAE INFORMATICA, 2023, 15 (01) : 91 - 110
  • [48] Enabling modern data discovery for atmospheric measurements
    Guntupally, Kavya
    Dumas, Kyle
    Prakash, Giri
    Devarakonda, Ranjeet
    Darnell, Wade
    Davis, Maggie
    Cederwall, Richard
    EARTH SCIENCE INFORMATICS, 2021, 14 (03) : 1487 - 1502
  • [49] Enabling modern data discovery for atmospheric measurements
    Kavya Guntupally
    Kyle Dumas
    Giri Prakash
    Ranjeet Devarakonda
    Wade Darnell
    Maggie Davis
    Richard Cederwall
    Earth Science Informatics, 2021, 14 : 1487 - 1502
  • [50] The Influence of Data Preparation on Outlier Detection in Driveability Data
    Ramsauer A.
    Baumann P.M.
    Lex C.
    SN Computer Science, 2021, 2 (3)