Enabling PII Discovery in Textual Data via Outlier Detection

被引：0

作者：

Islam, Md Rakibul ^{[1
]}

Kayem, Anne V. D. M. ^{[2
]}

Meinel, Christoph ^{[2
]}

机构：

[1] Univ Potsdam, Dept Computat Sci, Potsdam, Germany

[2] Univ Potsdam, Hasso Plattner Inst Digital Engn, Potsdam, Germany

来源：

DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II | 2023年 / 14147卷

关键词：

Outlier Detection; Named Entity Recognition; Data Masking; Personal Identifying Information (PII);

D O I：

10.1007/978-3-031-39821-6_17

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.

引用

页码：209 / 216

页数：8

共 50 条

[1] Towards Enabling Outlier Detection in Large, High Dimensional Data Warehouses
Georgoulas, Konstantinos
Kotidis, Yannis
SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2012, 2012, 7338 : 591 - 594
[2] A relative patterns discovery for enhancing outlier detection in categorical data
Pai, Hao-Ting
Wu, Fan
Hsueh, Pei-Yun S.
DECISION SUPPORT SYSTEMS, 2014, 67 : 90 - 99
[3] On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection
Park, Sangha
Mok, Jisoo
Jung, Dahuin
Lee, Saehyung
Yoon, Sungroh
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[4] Outlier detection for heterogeneous data via fuzzy /i covering
Li, Zhaowen
Feng, Danlu
Li, Jinjin
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 252
[5] Anomaly Detection for Virtualized Data Center via Outlier Analysis
Li, Zhengmin
Zhu, Chunge
Liu, Xinran
Sui, Xiufeng
PROCEEDINGS OF THE 2017 IEEE 14TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC 2017), 2017, : 163 - 167
[6] Outlier detection of multivariate data via the maximization of the cumulant generating function
Cesarone, Francesco
Giacometti, Rosella
Ricci, Jacopo Maria
JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2025, 461
[7] Enabling Efficient Privacy-Assured Outlier Detection Over Encrypted Incremental Data Sets
Lai, Shangqi
Yuan, Xingliang
Sakzad, Amin
Salehi, Mahsa
Liu, Joseph K.
Liu, Dongxi
IEEE INTERNET OF THINGS JOURNAL, 2020, 7 (04) : 2651 - 2662
[8] A frequent pattern discovery method for outlier detection
He, Zengyou
Xu, Xiaofei
Huang, Joshua Zhexue
Deng, Shengchun
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2004, 3129 : 726 - 732
[9] A frequent pattern discovery, method for Outlier detection
He, ZY
Xu, XF
Huang, JZX
Deng, SC
ADVANCES IN WEB-AGE INFORMATION MANAGEMENT: PROCEEDINGS, 2004, 3129 : 726 - 732
[10] Collusion set detection through outlier discovery
Janeja, VP
Atluri, V
Vaidya, J
Adam, NR
INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, 2005, 3495 : 1 - 13

← 1 2 3 4 5 →