Enabling PII Discovery in Textual Data via Outlier Detection

被引:0
|
作者
Islam, Md Rakibul [1 ]
Kayem, Anne V. D. M. [2 ]
Meinel, Christoph [2 ]
机构
[1] Univ Potsdam, Dept Computat Sci, Potsdam, Germany
[2] Univ Potsdam, Hasso Plattner Inst Digital Engn, Potsdam, Germany
来源
DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2023, PT II | 2023年 / 14147卷
关键词
Outlier Detection; Named Entity Recognition; Data Masking; Personal Identifying Information (PII);
D O I
10.1007/978-3-031-39821-6_17
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Discovering Personal Identifying Information (PII) in textual data is an important pre-processing step to enabling privacy preserving data analytics. One approach to PII discovery in textual data is to characterise the PII as abnormal or unusual observations that can potentially result in privacy violations. However, discovering PII in textual data is challenging because the data is unstructured, and comprises sparse representations of similar text elements. This limits the availability of labeled data for training and the accuracy of PII discovery. In this paper, we present an approach to discovering PII in textual data by characterising the PII as outliers. The PII discovery is done without labelled data, and the PII are identified using named entities. Based on the recognised named entities, we then employ five (5) unsupervised outlier detection models (LOF, DBSCAN, iForest, OCSVM, and SUOD). Our performance comparison results indicate that iForest offers the best prediction accuracy with an ROC AUC value of 0.89. We employ a masking mechanism, to replace discovered PII with semantically similar values. Our results indicate a median semantic similarity score of 0.461 between original and transformed texts which results in low information loss.
引用
收藏
页码:209 / 216
页数:8
相关论文
共 50 条
  • [31] Outlier detection for questionnaire data in biobanks
    Sakurai, Rieko
    Ueki, Masao
    Makino, Satoshi
    Hozawa, Atsushi
    Kuriyama, Shinichi
    Takai-Igarashi, Takako
    Kinoshita, Kengo
    Yamamoto, Masayuki
    Tamiya, Gen
    INTERNATIONAL JOURNAL OF EPIDEMIOLOGY, 2019, 48 (04) : 1305 - 1315
  • [32] Outlier detection in test and questionnaire data
    Zijlstra, Wobbe P.
    van der Ark, L. Andries
    Sijtsma, Klaas
    MULTIVARIATE BEHAVIORAL RESEARCH, 2007, 42 (03) : 531 - 555
  • [33] Online Outlier Detection for Data Streams
    Sadik, Shiblee
    Gruenwald, Le
    PROCEEDINGS OF THE 15TH INTERNATIONAL DATABASE ENGINEERING & APPLICATIONS SYMPOSIUM (IDEAS '11), 2011, : 88 - 96
  • [34] Outlier Detection in High Dimensional Data
    Kamalov, Firuz
    Leung, Ho Hon
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2020, 19 (01)
  • [35] Outlier detection in Serbian CommonCrawl Data
    Kalusev, Vladimir
    Culibrk, Dubravko
    2024 23RD INTERNATIONAL SYMPOSIUM INFOTEH-JAHORINA, INFOTEH, 2024,
  • [36] Universal outlier detection for PIV data
    Westerweel, J
    Scarano, F
    EXPERIMENTS IN FLUIDS, 2005, 39 (06) : 1096 - 1100
  • [37] Outlier Detection for Temporal Data: A Survey
    Gupta, Manish
    Gao, Jing
    Aggarwal, Charu C.
    Han, Jiawei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (09) : 2250 - 2267
  • [38] Outlier Detection on Uncertain Data Streams
    Zhu B.
    Zhong Y.
    Wang X.
    Bai M.
    Hunan Daxue Xuebao/Journal of Hunan University Natural Sciences, 2020, 47 (02): : 134 - 140
  • [39] Outlier detection in large data sets
    Buzzi-Ferraris, Guido
    Manenti, Flavio
    COMPUTERS & CHEMICAL ENGINEERING, 2011, 35 (02) : 388 - 390
  • [40] Outlier Detection Based on the Data Structure
    Guo, Feng
    Shi, Canghong
    Li, Xiaojie
    He, Jia
    Wu, Xi
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,