Towards privacy preserving unstructured big data publishing

被引:7
|
作者
Mehta, Brijesh [1 ]
Rao, Udai Pratap [2 ]
Gupta, Ruchika [3 ]
Conti, Mauro [4 ,5 ]
机构
[1] Maharana Pratap Univ Agr & Technol, Coll Technol & Engn, Dept Comp Sci & Engn, Udaipur, Rajasthan, India
[2] Sardar Vallabhbhai Natl Inst Technol, Comp Engn Dept, Surat, Gujarat, India
[3] Chandigarh Univ, Comp Sci & Engn Dept, Mohali, Punjab, India
[4] Univ Padua, Dept Math, Padua, Italy
[5] Univ Padua, HIT Ctr, Padua, Italy
关键词
Privacy preserving big data publishing; unstructured data privacy; named entity recognition; k-anonymity; scalable k-anonymization; ANONYMIZATION;
D O I
10.3233/JIFS-181231
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Various sources and sophisticated tools are used to gather and process the comparatively large volume of data or big data that sometimes leads to privacy disclosure (at broader or finer level) for the data owner. Privacy preserving data publishing approaches such as k-anonymity, l-diversity, and t-closeness are very well used to de-identify data, however, chances of re-identification of attributes always exist as data is collected from multiple sources such as public web, social media, Internet whereabouts, and sensors that are highly prone to data linkages. In literature, k-anonymity stands out amongst the most popular mainstream data anonymization approaches that can also be used for large sized data. However, applying k-anonymization for variety of data (especially unstructured data) is difficult in the traditional way, due to the fact that it requires the given data to be classified into the personal data, the quasi identifiers, and the sensitive data. We identify existing approaches from the literature of Natural Language Processing(NLP) to convert the unstructured data to structured form in order to apply k-anonymization over the generated structured records. We adopt a two phase Conditional Random Field (CRF) based Named Entity Recognition (NER) approach to represent unstructured data into the structured form. Further, we propose an Improved Scalable k-Anonymization (ImSKA) to anonymize the well represented unstructured data that achieves privacy preserving unstructured big data publishing. We compare both of the propose approaches namely NER and ImSKA with existing approaches and the results show that our proposed solutions outperform the existing approaches in terms of F1 score and Normalized Cardinality Penalty (NCP), respectively. Since, NER approaches are widely used for bio-medical datasets, we have also used a well-known Bio-NER dataset called GENIA corpus for measuring the performance.
引用
收藏
页码:3471 / 3482
页数:12
相关论文
共 50 条
  • [1] Privacy Preserving Big Data Publishing
    Canbay, Yavuz
    Vural, Yilmaz
    Sagiroglu, Seref
    [J]. 2018 INTERNATIONAL CONGRESS ON BIG DATA, DEEP LEARNING AND FIGHTING CYBER TERRORISM (IBIGDELFT), 2018, : 24 - 29
  • [2] Privacy-Preserving Big Data Publishing
    Zakerzadeh, Hessam
    Aggarwal, Charu C.
    Barker, Ken
    [J]. PROCEEDINGS OF THE 27TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2015,
  • [3] Privacy Preserving Unstructured Big Data Analytics: Issues and Challenges
    Mehta, Brijesh B.
    Rao, Udai Pratap
    [J]. 1ST INTERNATIONAL CONFERENCE ON INFORMATION SECURITY & PRIVACY 2015, 2016, 78 : 120 - 124
  • [4] Towards Privacy-Preserving Speech Data Publishing
    Qian, Jianwei
    Han, Feng
    Hou, Jiahui
    Zhang, Chunhong
    Wang, Yu
    Li, Xiang-Yang
    [J]. IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2018), 2018, : 1088 - 1096
  • [5] Conceptual Model Suggestions for Privacy Preserving Big Data Publishing
    Canbay, Yavuz
    Vural, Yilmaz
    Sagiroglu, Seref
    [J]. JOURNAL OF POLYTECHNIC-POLITEKNIK DERGISI, 2020, 23 (03): : 785 - 798
  • [6] Toward Scalable Anonymization for Privacy-Preserving Big Data Publishing
    Mehta, Brijesh B.
    Rao, Udai Pratap
    [J]. RECENT FINDINGS IN INTELLIGENT COMPUTING TECHNIQUES, VOL 2, 2018, 708 : 297 - 304
  • [7] Privacy preserving data publishing based on sensitivity in context of Big Data using Hive
    Rao P.S.
    Satyanarayana S.
    [J]. Journal of Big Data, 5 (1)
  • [8] Privacy-Preserving Equality Test Towards Big Data
    Saha, Tushar Kanti
    Koshiba, Takeshi
    [J]. FOUNDATIONS AND PRACTICE OF SECURITY (FPS 2017), 2018, 10723 : 95 - 110
  • [9] Privacy-Preserving Data Publishing
    Liu, Ruilin
    Wang, Hui
    [J]. 2010 IEEE 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDE 2010), 2010, : 305 - 308
  • [10] Privacy preserving in sequential data publishing
    Hossain, Md. Muktar
    Islam, Md. Rabiul
    [J]. 2021 International Conference on Automation, Control and Mechatronics for Industry 4.0, ACMI 2021, 2021,