Enabling PII Discovery in Textual Data via Outlier Detection

被引：0

作者：

Islam, Md Rakibul ^{[1
]}

Kayem, Anne V. D. M. ^{[2
]}

Meinel, Christoph ^{[2
]}

机构：

[1] Univ Potsdam, Dept Computat Sci, Potsdam, Germany

[2] Univ Potsdam, Hasso Plattner Inst Digital Engn, Potsdam, Germany

来源：

DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II | 2023年 / 14147卷

关键词：

Outlier Detection; Named Entity Recognition; Data Masking; Personal Identifying Information (PII);

D O I：

10.1007/978-3-031-39821-6_17

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.

引用

页码：209 / 216

页数：8

共 50 条

[21] Causal discovery from medical textual data
Mani, S
Cooper, GF
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2000, : 542 - 546
[22] Cellwise outlier detection with false discovery rate control
Liu, Yanhong
Ren, Haojie
Guo, Xu
Zhou, Qin
Zou, Changliang
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2022, 50 (03): : 951 - 971
[23] A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets
Miller, DJ
Browning, J
2003 IEEE XIII WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING - NNSP'03, 2003, : 489 - 498
[24] Extended knowledge discovery framework for outlier data set
Jin, Yi-Fu
Zhu, Qing-Sheng
Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2008, 36 (09): : 31 - 36
[25] Outlier Detection Algorithms in Data Mining
Xi, Jingke
2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL I, PROCEEDINGS, 2008, : 94 - 97
[26] Unsupervised outlier detection in multidimensional data
Atiq ur Rehman
Samir Brahim Belhaouari
Journal of Big Data, 8
[27] Outlier detection in time series data
Choi, Jeong In
Um, In Ok
Cho, Hyung Jun
KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (05) : 907 - 920
[28] Outlier detection for multivariate categorical data
Puig, Xavier
Ginebra, Josep
QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, 2018, 34 (07) : 1400 - 1412
[29] Outlier detection in multivariate hydrologic data
Kirk, Adam J.
McCuen, Richard H.
JOURNAL OF HYDROLOGIC ENGINEERING, 2008, 13 (07) : 641 - 646
[30] Outlier detection in process plant data
Chen, J
Bandoni, A
Romagnoli, JA
COMPUTERS & CHEMICAL ENGINEERING, 1998, 22 (4-5) : 641 - 646

← 1 2 3 4 5 →