Towards privacy preserving unstructured big data publishing

被引:7
|
作者
Mehta, Brijesh [1 ]
Rao, Udai Pratap [2 ]
Gupta, Ruchika [3 ]
Conti, Mauro [4 ,5 ]
机构
[1] Maharana Pratap Univ Agr & Technol, Coll Technol & Engn, Dept Comp Sci & Engn, Udaipur, Rajasthan, India
[2] Sardar Vallabhbhai Natl Inst Technol, Comp Engn Dept, Surat, Gujarat, India
[3] Chandigarh Univ, Comp Sci & Engn Dept, Mohali, Punjab, India
[4] Univ Padua, Dept Math, Padua, Italy
[5] Univ Padua, HIT Ctr, Padua, Italy
关键词
Privacy preserving big data publishing; unstructured data privacy; named entity recognition; k-anonymity; scalable k-anonymization; ANONYMIZATION;
D O I
10.3233/JIFS-181231
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Various sources and sophisticated tools are used to gather and process the comparatively large volume of data or big data that sometimes leads to privacy disclosure (at broader or finer level) for the data owner. Privacy preserving data publishing approaches such as k-anonymity, l-diversity, and t-closeness are very well used to de-identify data, however, chances of re-identification of attributes always exist as data is collected from multiple sources such as public web, social media, Internet whereabouts, and sensors that are highly prone to data linkages. In literature, k-anonymity stands out amongst the most popular mainstream data anonymization approaches that can also be used for large sized data. However, applying k-anonymization for variety of data (especially unstructured data) is difficult in the traditional way, due to the fact that it requires the given data to be classified into the personal data, the quasi identifiers, and the sensitive data. We identify existing approaches from the literature of Natural Language Processing(NLP) to convert the unstructured data to structured form in order to apply k-anonymization over the generated structured records. We adopt a two phase Conditional Random Field (CRF) based Named Entity Recognition (NER) approach to represent unstructured data into the structured form. Further, we propose an Improved Scalable k-Anonymization (ImSKA) to anonymize the well represented unstructured data that achieves privacy preserving unstructured big data publishing. We compare both of the propose approaches namely NER and ImSKA with existing approaches and the results show that our proposed solutions outperform the existing approaches in terms of F1 score and Normalized Cardinality Penalty (NCP), respectively. Since, NER approaches are widely used for bio-medical datasets, we have also used a well-known Bio-NER dataset called GENIA corpus for measuring the performance.
引用
收藏
页码:3471 / 3482
页数:12
相关论文
共 50 条
  • [31] Privacy preserving big data publishing: a scalable k-anonymization approach using MapReduce
    Mehta, Brijesh B.
    Rao, Udai Pratap
    [J]. IET SOFTWARE, 2017, 11 (05) : 271 - 276
  • [32] Differential Privacy in Power Big Data Publishing
    Kong, Ping
    Wang, Xiaochun
    Zhang, Boyi
    Li, Yidong
    [J]. PARALLEL ARCHITECTURE, ALGORITHM AND PROGRAMMING, PAAP 2017, 2017, 729 : 471 - 479
  • [33] Privacy-preserving clustering of unstructured big data for cloud-based enterprise search solutions
    Zobaed, Sm
    Salehi, Mohsen Amini
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (22):
  • [34] DATA MINING AS A TOOL IN PRIVACY-PRESERVING DATA PUBLISHING
    Sramka, Michal
    [J]. NILCRYPT 10, 2010, 45 : 151 - 159
  • [35] Privacy Preserving Data Publishing and Data Anonymization Approaches: A Review
    Goswami, Puneet
    Madan, Suman
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2017, : 139 - 142
  • [36] Cooperative privacy game: a novel strategy for preserving privacy in data publishing
    Kumari, Valli
    Chakravarthy, Srinivasa
    [J]. HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2016, 6
  • [37] Personalized Privacy-Preserving Trajectory Data Publishing
    Lu Qiwei
    Wang Caimei
    Xiong Yan
    Xia Huihua
    Huang Wenchao
    Gong Xudong
    [J]. CHINESE JOURNAL OF ELECTRONICS, 2017, 26 (02) : 285 - 291
  • [38] Privacy-Preserving Continuous Event Data Publishing
    Rafiei, Majid
    van der Aalst, Wil M. P.
    [J]. BUSINESS PROCESS MANAGEMENT FORUM (BPM 2021), 2021, 427 : 178 - 194
  • [39] Privacy Preserving Serial Data Publishing By Role Composition
    Bu, Yingyi
    Fu, Ada Wai Chee
    Wong, Raymond Chi Wing
    Chen, Lei
    Li, Jiuyong
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 845 - 856
  • [40] Privacy-preserving data publishing for cluster analysis
    Fung, Benjamin C. M.
    Wang, Ke
    Wang, Lingyu
    Hung, Patrick C. K.
    [J]. DATA & KNOWLEDGE ENGINEERING, 2009, 68 (06) : 552 - 575