Enabling PII Discovery in Textual Data via Outlier Detection

被引:0
|
作者
Islam, Md Rakibul [1 ]
Kayem, Anne V. D. M. [2 ]
Meinel, Christoph [2 ]
机构
[1] Univ Potsdam, Dept Computat Sci, Potsdam, Germany
[2] Univ Potsdam, Hasso Plattner Inst Digital Engn, Potsdam, Germany
来源
DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II | 2023年 / 14147卷
关键词
Outlier Detection; Named Entity Recognition; Data Masking; Personal Identifying Information (PII);
D O I
10.1007/978-3-031-39821-6_17
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.
引用
收藏
页码:209 / 216
页数:8
相关论文
共 50 条
  • [1] Towards Enabling Outlier Detection in Large, High Dimensional Data Warehouses
    Georgoulas, Konstantinos
    Kotidis, Yannis
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2012, 2012, 7338 : 591 - 594
  • [2] A relative patterns discovery for enhancing outlier detection in categorical data
    Pai, Hao-Ting
    Wu, Fan
    Hsueh, Pei-Yun S.
    DECISION SUPPORT SYSTEMS, 2014, 67 : 90 - 99
  • [3] On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection
    Park, Sangha
    Mok, Jisoo
    Jung, Dahuin
    Lee, Saehyung
    Yoon, Sungroh
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Outlier detection for heterogeneous data via fuzzy /i covering
    Li, Zhaowen
    Feng, Danlu
    Li, Jinjin
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 252
  • [5] Anomaly Detection for Virtualized Data Center via Outlier Analysis
    Li, Zhengmin
    Zhu, Chunge
    Liu, Xinran
    Sui, Xiufeng
    PROCEEDINGS OF THE 2017 IEEE 14TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC 2017), 2017, : 163 - 167
  • [6] Outlier detection of multivariate data via the maximization of the cumulant generating function
    Cesarone, Francesco
    Giacometti, Rosella
    Ricci, Jacopo Maria
    JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2025, 461
  • [7] Enabling Efficient Privacy-Assured Outlier Detection Over Encrypted Incremental Data Sets
    Lai, Shangqi
    Yuan, Xingliang
    Sakzad, Amin
    Salehi, Mahsa
    Liu, Joseph K.
    Liu, Dongxi
    IEEE INTERNET OF THINGS JOURNAL, 2020, 7 (04) : 2651 - 2662
  • [8] A frequent pattern discovery method for outlier detection
    He, Zengyou
    Xu, Xiaofei
    Huang, Joshua Zhexue
    Deng, Shengchun
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2004, 3129 : 726 - 732
  • [9] A frequent pattern discovery, method for Outlier detection
    He, ZY
    Xu, XF
    Huang, JZX
    Deng, SC
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT: PROCEEDINGS, 2004, 3129 : 726 - 732
  • [10] Collusion set detection through outlier discovery
    Janeja, VP
    Atluri, V
    Vaidya, J
    Adam, NR
    INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, 2005, 3495 : 1 - 13