Towards privacy preserving unstructured big data publishing

被引:7
|
作者
Mehta, Brijesh [1 ]
Rao, Udai Pratap [2 ]
Gupta, Ruchika [3 ]
Conti, Mauro [4 ,5 ]
机构
[1] Maharana Pratap Univ Agr & Technol, Coll Technol & Engn, Dept Comp Sci & Engn, Udaipur, Rajasthan, India
[2] Sardar Vallabhbhai Natl Inst Technol, Comp Engn Dept, Surat, Gujarat, India
[3] Chandigarh Univ, Comp Sci & Engn Dept, Mohali, Punjab, India
[4] Univ Padua, Dept Math, Padua, Italy
[5] Univ Padua, HIT Ctr, Padua, Italy
关键词
Privacy preserving big data publishing; unstructured data privacy; named entity recognition; k-anonymity; scalable k-anonymization; ANONYMIZATION;
D O I
10.3233/JIFS-181231
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Various sources and sophisticated tools are used to gather and process the comparatively large volume of data or big data that sometimes leads to privacy disclosure (at broader or finer level) for the data owner. Privacy preserving data publishing approaches such as k-anonymity, l-diversity, and t-closeness are very well used to de-identify data, however, chances of re-identification of attributes always exist as data is collected from multiple sources such as public web, social media, Internet whereabouts, and sensors that are highly prone to data linkages. In literature, k-anonymity stands out amongst the most popular mainstream data anonymization approaches that can also be used for large sized data. However, applying k-anonymization for variety of data (especially unstructured data) is difficult in the traditional way, due to the fact that it requires the given data to be classified into the personal data, the quasi identifiers, and the sensitive data. We identify existing approaches from the literature of Natural Language Processing(NLP) to convert the unstructured data to structured form in order to apply k-anonymization over the generated structured records. We adopt a two phase Conditional Random Field (CRF) based Named Entity Recognition (NER) approach to represent unstructured data into the structured form. Further, we propose an Improved Scalable k-Anonymization (ImSKA) to anonymize the well represented unstructured data that achieves privacy preserving unstructured big data publishing. We compare both of the propose approaches namely NER and ImSKA with existing approaches and the results show that our proposed solutions outperform the existing approaches in terms of F1 score and Normalized Cardinality Penalty (NCP), respectively. Since, NER approaches are widely used for bio-medical datasets, we have also used a well-known Bio-NER dataset called GENIA corpus for measuring the performance.
引用
收藏
页码:3471 / 3482
页数:12
相关论文
共 50 条
  • [41] A General Framework for Privacy Preserving Sequential Data Publishing
    Hasan, A. S. M. Touhidul
    Jiang, Qingshan
    [J]. 2017 31ST IEEE INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS (IEEE WAINA 2017), 2017, : 519 - 524
  • [42] δ-Dependency for privacy-preserving XML data publishing
    Landberg, Anders H.
    Nguyen, Kinh
    Pardede, Eric
    Rahayu, J. Wenny
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 50 : 77 - 94
  • [43] PPFSCADA: Privacy preserving framework for SCADA data publishing
    Fahad, Adil
    Tari, Zahir
    Almalawi, Abdulmohsen
    Goscinski, Andrzej
    Khalil, Ibrahim
    Mahmood, Abdun
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 37 : 496 - 511
  • [44] Privacy-preserving data publishing for cluster analysis
    Fung, Benjamin C. M.
    Wang, Ke
    Wang, Lingyu
    Hung, Patrick C. K.
    [J]. DATA & KNOWLEDGE ENGINEERING, 2009, 68 (06) : 552 - 575
  • [45] Privacy Preserving Serial Data Publishing By Role Composition
    Bu, Yingyi
    Fu, Ada Wai Chee
    Wong, Raymond Chi Wing
    Chen, Lei
    Li, Jiuyong
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 845 - 856
  • [46] Privacy-Preserving Continuous Event Data Publishing
    Rafiei, Majid
    van der Aalst, Wil M. P.
    [J]. BUSINESS PROCESS MANAGEMENT FORUM (BPM 2021), 2021, 427 : 178 - 194
  • [47] Personalized Privacy-Preserving Trajectory Data Publishing
    LU Qiwei
    WANG Caimei
    XIONG Yan
    XIA Huihua
    HUANG Wenchao
    GONG Xudong
    [J]. Chinese Journal of Electronics, 2017, 26 (02) : 285 - 291
  • [48] Slicing: A New Approach for Privacy Preserving Data Publishing
    Li, Tiancheng
    Li, Ninghui
    Zhang, Jian
    Molloy, Ian
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (03) : 561 - 574
  • [49] Study on Privacy Preserving Technology in Data Publishing Scenario
    Xu Yong
    Zhou Shanying
    Sun Yutao
    [J]. PROGRESS IN CIVIL ENGINEERING, PTS 1-4, 2012, 170-173 : 3658 - 3661
  • [50] Personalized privacy preserving algorithm for trajectory data publishing
    [J]. Wu, Ying-Jie, 1600, Chinese Institute of Electronics (36):