Enabling PII Discovery in Textual Data via Outlier Detection

被引：0

作者：

Islam, Md Rakibul ^{[1
]}

Kayem, Anne V. D. M. ^{[2
]}

Meinel, Christoph ^{[2
]}

机构：

[1] Univ Potsdam, Dept Computat Sci, Potsdam, Germany

[2] Univ Potsdam, Hasso Plattner Inst Digital Engn, Potsdam, Germany

来源：

DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II | 2023年 / 14147卷

关键词：

Outlier Detection; Named Entity Recognition; Data Masking; Personal Identifying Information (PII);

D O I：

10.1007/978-3-031-39821-6_17

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.

引用

页码：209 / 216

页数：8

共 50 条

[31] Outlier detection for questionnaire data in biobanks
Sakurai, Rieko
Ueki, Masao
Makino, Satoshi
Hozawa, Atsushi
Kuriyama, Shinichi
Takai-Igarashi, Takako
Kinoshita, Kengo
Yamamoto, Masayuki
Tamiya, Gen
INTERNATIONAL JOURNAL OF EPIDEMIOLOGY, 2019, 48 (04) : 1305 - 1315
[32] Outlier detection in test and questionnaire data
Zijlstra, Wobbe P.
van der Ark, L. Andries
Sijtsma, Klaas
MULTIVARIATE BEHAVIORAL RESEARCH, 2007, 42 (03) : 531 - 555
[33] Online Outlier Detection for Data Streams
Sadik, Shiblee
Gruenwald, Le
PROCEEDINGS OF THE 15TH INTERNATIONAL DATABASE ENGINEERING & APPLICATIONS SYMPOSIUM (IDEAS '11), 2011, : 88 - 96
[34] Outlier Detection in High Dimensional Data
Kamalov, Firuz
Leung, Ho Hon
JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2020, 19 (01)
[35] Outlier detection in Serbian CommonCrawl Data
Kalusev, Vladimir
Culibrk, Dubravko
2024 23RD INTERNATIONAL SYMPOSIUM INFOTEH-JAHORINA, INFOTEH, 2024,
[36] Universal outlier detection for PIV data
Westerweel, J
Scarano, F
EXPERIMENTS IN FLUIDS, 2005, 39 (06) : 1096 - 1100
[37] Outlier Detection for Temporal Data: A Survey
Gupta, Manish
Gao, Jing
Aggarwal, Charu C.
Han, Jiawei
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (09) : 2250 - 2267
[38] Outlier Detection on Uncertain Data Streams
Zhu B.
Zhong Y.
Wang X.
Bai M.
Hunan Daxue Xuebao/Journal of Hunan University Natural Sciences, 2020, 47 (02): : 134 - 140
[39] Outlier detection in large data sets
Buzzi-Ferraris, Guido
Manenti, Flavio
COMPUTERS & CHEMICAL ENGINEERING, 2011, 35 (02) : 388 - 390
[40] Outlier Detection Based on the Data Structure
Guo, Feng
Shi, Canghong
Li, Xiaojie
He, Jia
Wu, Xi
2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,

← 1 2 3 4 5 →