Experimenting sensitivity-based anonymization framework in apache spark

被引:7
|
作者
Al-Zobbi, Mohammed [1 ]
Shahrestani, Seyed [1 ]
Ruan, Chun [1 ]
机构
[1] Western Sydney Univ, Sch Comp Engn & Math, Locked Bag 1797,Kingswood Campus, Sydney, NSW 2751, Australia
关键词
Spark; Anonymization; Big data; k-Anonymity; MapReduce; Sensitivity; SQL spark;
D O I
10.1186/s40537-018-0149-0
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
One of the biggest concerns of big data and analytics is privacy. We believe the forthcoming frameworks and theories will establish several solutions for the privacy protection. One of the known solutions is the k-anonymity that was introduced for traditional data. Recently, two major frameworks leveraged big data processing and applications; these are MapReduce and Spark. Spark data processing has been attracting more attention due to its crucial impacts on a wide range of big data applications. One of the predominant big data applications is data analytics and anonymization. We previously proposed an anonymization method for implementing k-anonymity in MapReduce processing framework. In this paper, we investigate Spark performance in processing data anonymization. Spark is a fast processing framework that was implemented in several applications such as: SQL, multimedia, and data stream. Our focus is the SQL Spark, which is adequate for big data anonymization. Since Spark operates in-memory, we need to observe its limitations, speed, and fault tolerance on data size increase, and to compare MapReduce to Spark in processing anonymity. Spark introduces an abstraction called resilient distributed datasets, which reads and serializes a collection of objects partitioned across a set of machines. Developers claim that Spark can outperform MapReduce by 10 times in iterative machine learning jobs. Our experiments in this paper compare between MapReduce and Spark. The overall results show a better performance for Spark's processing time in anonymity operations. However, in some limited cases, we prefer to implement the old MapReduce framework, when the cluster resources are limited and the network is non-congested.
引用
收藏
页数:26
相关论文
共 50 条
  • [1] Sensitivity-based Anonymization of Big Data
    Al-Zobbi, Mohammed
    Shahrestani, Seyed
    Ruan, Chun
    [J]. PROCEEDINGS OF THE 2016 IEEE 41ST CONFERENCE ON LOCAL COMPUTER NETWORKS - LCN WORKSHOPS 2016, 2016, : 58 - 64
  • [2] Towards Optimal Sensitivity-Based Anonymization for Big Data
    Al-Zobbi, Mohammed
    Shahrestani, Seyed
    Ruan, Chun
    [J]. 2017 27TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2017, : 331 - 336
  • [3] Improving MapReduce privacy by implementing multi-dimensional sensitivity-based anonymization
    Al-Zobbi M.
    Shahrestani S.
    Ruan C.
    [J]. Al-Zobbi, Mohammed (m.alzobbi@westernsydney.edu.au), 2017, SpringerOpen (04)
  • [4] A Top-Down k-Anonymization Implementation for Apache Spark
    Sopaoglu, Ugur
    Abul, Osman
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4513 - 4521
  • [5] Apache Flink and clustering-based framework for fast anonymization of IoT stream data
    Sadeghi-Nasab, Alireza
    Ghaffarian, Hossein
    Rahmani, Mohsen
    [J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 20
  • [6] A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark
    Bazai, Sibghat Ullah
    Jang-Jaccard, Julian
    Alavizadeh, Hooman
    [J]. ACM TRANSACTIONS ON PRIVACY AND SECURITY, 2022, 25 (01)
  • [7] AXS: A Framework for Fast Astronomical Data Processing Based on Apache Spark
    Zecevic, Petar
    Slater, Colin T.
    Juric, Mario
    Connolly, Andrew J.
    Loncaric, Sven
    Bellm, Eric C.
    Golkhou, V. Zach
    Suberlak, Krzysztof
    [J]. ASTRONOMICAL JOURNAL, 2019, 158 (01):
  • [8] Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark
    Bazai, Sibghat Ullah
    Jang-Jaccard, Julian
    Alavizadeh, Hooman
    [J]. ELECTRONICS, 2021, 10 (05) : 1 - 28
  • [9] A Dynamic Resource Allocation Framework for Apache Spark Applications
    Wang, Kewen
    Khan, Mohammad Maifi Hasan
    Nguyen, Nhan
    [J]. 2020 IEEE 44TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2020), 2020, : 997 - 1004
  • [10] HetSpark: A Framework that Provides Heterogeneous Executors to Apache Spark
    Hidri, Klodjan Klodi
    Bilas, Angelos
    Kozanitis, Christos
    [J]. 7TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE ON COMPUTATIONAL SCIENCE, YSC2018, 2018, 136 : 118 - 127