Enabling PII Discovery in Textual Data via Outlier Detection

被引：0

作者：

Islam, Md Rakibul ^{[1
]}

Kayem, Anne V. D. M. ^{[2
]}

Meinel, Christoph ^{[2
]}

机构：

[1] Univ Potsdam, Dept Computat Sci, Potsdam, Germany

[2] Univ Potsdam, Hasso Plattner Inst Digital Engn, Potsdam, Germany

来源：

DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II | 2023年 / 14147卷

关键词：

Outlier Detection; Named Entity Recognition; Data Masking; Personal Identifying Information (PII);

D O I：

10.1007/978-3-031-39821-6_17

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.

引用

页码：209 / 216

页数：8

共 50 条

[41] Outlier detection in process plant data
Chen, J.
Bandoni, A.
Romagnoli, J.A.
Computers and Chemical Engineering, 1998, 22 (4 /5): : 641 - 646
[42] Universal outlier detection for PIV data
Jerry Westerweel
Fulvio Scarano
Experiments in Fluids, 2005, 39 : 1096 - 1100
[43] Using data images for outlier detection
Marchette, DJ
Solka, JL
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2003, 43 (04) : 541 - 552
[44] Outlier detection for high dimensional data
Aggarwal, CC
Yu, PS
SIGMOD RECORD, 2001, 30 (02) : 37 - 46
[45] Unsupervised outlier detection in multidimensional data
Ur Rehman, Atiq
Belhaouari, Samir Brahim
JOURNAL OF BIG DATA, 2021, 8 (01)
[46] Outlier: Enabling Effective Measurement of Hypervisor Code Integrity With Group Detection
Gu, Jianan
Ma, Yukun
Zheng, Beilei
Weng, Chuliang
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2022, 19 (06) : 3686 - 3698
[47] Textual outlier detection with an unsupervised method using text similarity and density peak
Sereshki, Mahnaz Taleb
Zanjireh, Morteza Mohammadi
Bahaghighat, Mahdi
ACTA UNIVERSITATIS SAPIENTIAE INFORMATICA, 2023, 15 (01) : 91 - 110
[48] Enabling modern data discovery for atmospheric measurements
Guntupally, Kavya
Dumas, Kyle
Prakash, Giri
Devarakonda, Ranjeet
Darnell, Wade
Davis, Maggie
Cederwall, Richard
EARTH SCIENCE INFORMATICS, 2021, 14 (03) : 1487 - 1502
[49] Enabling modern data discovery for atmospheric measurements
Kavya Guntupally
Kyle Dumas
Giri Prakash
Ranjeet Devarakonda
Wade Darnell
Maggie Davis
Richard Cederwall
Earth Science Informatics, 2021, 14 : 1487 - 1502
[50] The Influence of Data Preparation on Outlier Detection in Driveability Data
Ramsauer A.
Baumann P.M.
Lex C.
SN Computer Science, 2021, 2 (3)

← 1 2 3 4 5 →