An Advanced Semantic Feature-Based Cross-Domain PII Detection, De-Identification, and Re-Identification Model Using Ensemble Learning

被引:0
|
作者
Kulkarni, Poornima [1 ]
Cauvery, N. K. [1 ]
Hemavathy, R. [2 ]
机构
[1] RV Coll Engn, Dept ISE, Bengaluru, India
[2] RV Coll Engn, Dept CSE, Bengaluru, India
关键词
PII Detection; machine learning; natural language processing; artificial intelligence; de-identification; PERSONALLY IDENTIFIABLE INFORMATION; PRIVACY; PROTECTION; MACHINE;
D O I
10.14569/IJACSA.2024.0151277
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The digital data being core to any system requires communication across peers and human machine interfaces; however, ensuring (data) security and privacy remains a challenge for the industries, especially under the threat of man-in-the- middle attacks, intruders and even ill-intended unauthorized access at warehouses. Almost all digital communication practices embody personally identifiable information (PII) like an individual's address, contact details, identification credentials etc. The unauthorized or ill-intended access to these PII attributes can cause major losses to the individual and therefore it is inevitable to identify and de-identify aforesaid PII elements across digital platforms to preserve privacy. Unfortunately, the diversity of PII attributes across disciplines makes it challenging for state-of-arts to perform PII detection by using a predefined dictionary. The model developed for a specific PII type can't be universally viable for other disciplines. Moreover, applying multiple dictionaries for the different disciplines can make a solution more exhaustive. To alleviate these challenges, in this paper a robust ensemble of ensemble learning assisted semantic feature driven cross- discipline PII detection and de-identification model (EESD-PII) is proposed. To achieve it, a large set of text queries encompassing diverse PII attributes including personal credentials, healthcare data, finance attributes etc. were considered for training based PII detection and classification. The input texts were processed for the different preprocessing tasks including stopping-word removal, punctuation removal, website-link removal, lower case conversion, lemmatization and tokenization. The tokenized text was processed for Word2Vec driven continuous bag-of-word (CBOW) embedding that not only provided latent feature space for analytics but also enabled de-identification to preserve security aspects. To address class-imbalance problems, synthetic minority over-sampling techniques like SMOTE, SMOTE-BL, SMOTEENN were applied. Subsequently, the resampled features were processed for the feature selection by using Wilcoxon Rank Sum Test (WRST) method that in sync with 95% confidence interval retained the most significant features. The selected features were processed for Min-Max Normalization to alleviate over-fitting and convergence problems, while the normalized feature vector was classified by using ensemble of ensemble learning model encompassing Bagging, Boosting, AdaBoost, Random Forest and Extra Tree Classifier as base classifier. The proposed model performed a consensus-based majority voting ensemble to annotate each text-query as PII or Non-PII data. The positively annotated query can later be processed for dictionary-based PII attribute masking to achieve de-identification. Though, the use of semantic embedding serves the purpose towards NLP-based PII detection, de identification and re-identification tasks. The simulation results reveal that the proposed EESD-PII model achieves PII annotation accuracy of 99.77%, precision 99.81%, recall 99.63% and F-Measure of 99.71%.
引用
收藏
页码:763 / 779
页数:17
相关论文
共 50 条
  • [1] Cross-domain person re-identification based on partial semantic feature invariance
    Zhang X.
    Lyu M.
    Li H.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2020, 46 (09): : 1682 - 1690
  • [2] Cross-Domain Adversarial Feature Learning for Sketch Re-identification
    Pang, Lu
    Wang, Yaowei
    Song, Yi-Zhe
    Huang, Tiejun
    Tian, Yonghong
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 609 - 617
  • [3] Cross-Domain Person Re-Identification Based on Feature Fusion
    Luo, Xianjun
    Ouyang, Zhi
    Du, Nisuo
    Song, Jingkuan
    Wei, Qin
    IEEE ACCESS, 2021, 9 : 98327 - 98336
  • [4] Cross-Domain Person Re-Identification Based on Feature Fusion Invariance
    Zhang, Yushi
    Song, Heping
    Wei, Jiawei
    APPLIED SCIENCES-BASEL, 2024, 14 (11):
  • [5] Cross-domain person re-identification with normalized and enhanced feature
    Jia Z.
    Wang W.
    Li Y.
    Zeng Y.
    Wang Z.
    Yin G.
    Multimedia Tools and Applications, 2024, 83 (18) : 56077 - 56101
  • [6] Collaborative representation based cross-domain semantic transfer for vehicle re-identification
    Li, Yun
    Yang, Fan
    Tian, Yudou
    Wang, Xuejun
    Chen, Qi
    Jing, Peiguang
    NEUROCOMPUTING, 2024, 567
  • [7] PROXY TASK LEARNING FOR CROSS-DOMAIN PERSON RE-IDENTIFICATION
    Huang, Houjing
    Chen, Xiaotang
    Huang, Kaiqi
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [8] Biclustering Collaborative Learning for Cross-Domain Person Re-Identification
    Pang, Zhiqi
    Guo, Jifeng
    Sun, Wenbo
    Li, Shi
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 2142 - 2146
  • [9] Adaptive Cross-domain Learning for Generalizable Person Re-identification
    Zhang, Pengyi
    Dou, Huanzhang
    Yu, Yunlong
    Li, Xi
    COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 215 - 232
  • [10] Generalizable and efficient cross-domain person re-identification model using deep metric learning
    Imani, Saba Sadat Faghih
    Fouladi-Ghaleh, Kazim
    Aghababa, Hossein
    IET COMPUTER VISION, 2023, 17 (08) : 993 - 1004