Enabling PII Discovery in Textual Data via Outlier Detection

被引:0
|
作者
Islam, Md Rakibul [1 ]
Kayem, Anne V. D. M. [2 ]
Meinel, Christoph [2 ]
机构
[1] Univ Potsdam, Dept Computat Sci, Potsdam, Germany
[2] Univ Potsdam, Hasso Plattner Inst Digital Engn, Potsdam, Germany
来源
DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II | 2023年 / 14147卷
关键词
Outlier Detection; Named Entity Recognition; Data Masking; Personal Identifying Information (PII);
D O I
10.1007/978-3-031-39821-6_17
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.
引用
收藏
页码:209 / 216
页数:8
相关论文
共 50 条
  • [21] Causal discovery from medical textual data
    Mani, S
    Cooper, GF
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2000, : 542 - 546
  • [22] Cellwise outlier detection with false discovery rate control
    Liu, Yanhong
    Ren, Haojie
    Guo, Xu
    Zhou, Qin
    Zou, Changliang
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2022, 50 (03): : 951 - 971
  • [23] A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets
    Miller, DJ
    Browning, J
    2003 IEEE XIII WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING - NNSP'03, 2003, : 489 - 498
  • [24] Extended knowledge discovery framework for outlier data set
    Jin, Yi-Fu
    Zhu, Qing-Sheng
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2008, 36 (09): : 31 - 36
  • [25] Outlier Detection Algorithms in Data Mining
    Xi, Jingke
    2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL I, PROCEEDINGS, 2008, : 94 - 97
  • [26] Unsupervised outlier detection in multidimensional data
    Atiq ur Rehman
    Samir Brahim Belhaouari
    Journal of Big Data, 8
  • [27] Outlier detection in time series data
    Choi, Jeong In
    Um, In Ok
    Cho, Hyung Jun
    KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (05) : 907 - 920
  • [28] Outlier detection for multivariate categorical data
    Puig, Xavier
    Ginebra, Josep
    QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, 2018, 34 (07) : 1400 - 1412
  • [29] Outlier detection in multivariate hydrologic data
    Kirk, Adam J.
    McCuen, Richard H.
    JOURNAL OF HYDROLOGIC ENGINEERING, 2008, 13 (07) : 641 - 646
  • [30] Outlier detection in process plant data
    Chen, J
    Bandoni, A
    Romagnoli, JA
    COMPUTERS & CHEMICAL ENGINEERING, 1998, 22 (4-5) : 641 - 646