An Advanced Semantic Feature-Based Cross-Domain PII Detection, De-Identification, and Re-Identification Model Using Ensemble Learning

被引：0

作者：

Kulkarni, Poornima ^{[1
]}

Cauvery, N. K. ^{[1
]}

Hemavathy, R. ^{[2
]}

机构：

[1] RV Coll Engn, Dept ISE, Bengaluru, India

[2] RV Coll Engn, Dept CSE, Bengaluru, India

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2024年 / 15卷 / 12期

关键词：

PII Detection; machine learning; natural language processing; artificial intelligence; de-identification; PERSONALLY IDENTIFIABLE INFORMATION; PRIVACY; PROTECTION; MACHINE;

D O I：

10.14569/IJACSA.2024.0151277

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The digital data being core to any system requires communication across peers and human machine interfaces; however, ensuring (data) security and privacy remains a challenge for the industries, especially under the threat of man-in-the- middle attacks, intruders and even ill-intended unauthorized access at warehouses. Almost all digital communication practices embody personally identifiable information (PII) like an individual's address, contact details, identification credentials etc. The unauthorized or ill-intended access to these PII attributes can cause major losses to the individual and therefore it is inevitable to identify and de-identify aforesaid PII elements across digital platforms to preserve privacy. Unfortunately, the diversity of PII attributes across disciplines makes it challenging for state-of-arts to perform PII detection by using a predefined dictionary. The model developed for a specific PII type can't be universally viable for other disciplines. Moreover, applying multiple dictionaries for the different disciplines can make a solution more exhaustive. To alleviate these challenges, in this paper a robust ensemble of ensemble learning assisted semantic feature driven cross- discipline PII detection and de-identification model (EESD-PII) is proposed. To achieve it, a large set of text queries encompassing diverse PII attributes including personal credentials, healthcare data, finance attributes etc. were considered for training based PII detection and classification. The input texts were processed for the different preprocessing tasks including stopping-word removal, punctuation removal, website-link removal, lower case conversion, lemmatization and tokenization. The tokenized text was processed for Word2Vec driven continuous bag-of-word (CBOW) embedding that not only provided latent feature space for analytics but also enabled de-identification to preserve security aspects. To address class-imbalance problems, synthetic minority over-sampling techniques like SMOTE, SMOTE-BL, SMOTEENN were applied. Subsequently, the resampled features were processed for the feature selection by using Wilcoxon Rank Sum Test (WRST) method that in sync with 95% confidence interval retained the most significant features. The selected features were processed for Min-Max Normalization to alleviate over-fitting and convergence problems, while the normalized feature vector was classified by using ensemble of ensemble learning model encompassing Bagging, Boosting, AdaBoost, Random Forest and Extra Tree Classifier as base classifier. The proposed model performed a consensus-based majority voting ensemble to annotate each text-query as PII or Non-PII data. The positively annotated query can later be processed for dictionary-based PII attribute masking to achieve de-identification. Though, the use of semantic embedding serves the purpose towards NLP-based PII detection, de identification and re-identification tasks. The simulation results reveal that the proposed EESD-PII model achieves PII annotation accuracy of 99.77%, precision 99.81%, recall 99.63% and F-Measure of 99.71%.

引用

页码：763 / 779

页数：17

共 50 条

[1] Cross-domain person re-identification based on partial semantic feature invariance
Zhang X.
Lyu M.
Li H.
Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2020, 46 (09): : 1682 - 1690
[2] Cross-Domain Adversarial Feature Learning for Sketch Re-identification
Pang, Lu
Wang, Yaowei
Song, Yi-Zhe
Huang, Tiejun
Tian, Yonghong
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 609 - 617
[3] Cross-Domain Person Re-Identification Based on Feature Fusion
Luo, Xianjun
Ouyang, Zhi
Du, Nisuo
Song, Jingkuan
Wei, Qin
IEEE ACCESS, 2021, 9 : 98327 - 98336
[4] Cross-Domain Person Re-Identification Based on Feature Fusion Invariance
Zhang, Yushi
Song, Heping
Wei, Jiawei
APPLIED SCIENCES-BASEL, 2024, 14 (11):
[5] Cross-domain person re-identification with normalized and enhanced feature
Jia Z.
Wang W.
Li Y.
Zeng Y.
Wang Z.
Yin G.
Multimedia Tools and Applications, 2024, 83 (18) : 56077 - 56101
[6] Collaborative representation based cross-domain semantic transfer for vehicle re-identification
Li, Yun
Yang, Fan
Tian, Yudou
Wang, Xuejun
Chen, Qi
Jing, Peiguang
NEUROCOMPUTING, 2024, 567
[7] PROXY TASK LEARNING FOR CROSS-DOMAIN PERSON RE-IDENTIFICATION
Huang, Houjing
Chen, Xiaotang
Huang, Kaiqi
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[8] Biclustering Collaborative Learning for Cross-Domain Person Re-Identification
Pang, Zhiqi
Guo, Jifeng
Sun, Wenbo
Li, Shi
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 2142 - 2146
[9] Adaptive Cross-domain Learning for Generalizable Person Re-identification
Zhang, Pengyi
Dou, Huanzhang
Yu, Yunlong
Li, Xi
COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 215 - 232
[10] Generalizable and efficient cross-domain person re-identification model using deep metric learning
Imani, Saba Sadat Faghih
Fouladi-Ghaleh, Kazim
Aghababa, Hossein
IET COMPUTER VISION, 2023, 17 (08) : 993 - 1004

← 1 2 3 4 5 →