Experimenting sensitivity-based anonymization framework in apache spark

被引：7

作者：

Al-Zobbi, Mohammed ^{[1
]}

Shahrestani, Seyed ^{[1
]}

Ruan, Chun ^{[1
]}

机构：

[1] Western Sydney Univ, Sch Comp Engn & Math, Locked Bag 1797,Kingswood Campus, Sydney, NSW 2751, Australia

来源：

JOURNAL OF BIG DATA | 2018年 / 5卷 / 01期

关键词：

Spark; Anonymization; Big data; k-Anonymity; MapReduce; Sensitivity; SQL spark;

D O I：

10.1186/s40537-018-0149-0

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

One of the biggest concerns of big data and analytics is privacy. We believe the forthcoming frameworks and theories will establish several solutions for the privacy protection. One of the known solutions is the k-anonymity that was introduced for traditional data. Recently, two major frameworks leveraged big data processing and applications; these are MapReduce and Spark. Spark data processing has been attracting more attention due to its crucial impacts on a wide range of big data applications. One of the predominant big data applications is data analytics and anonymization. We previously proposed an anonymization method for implementing k-anonymity in MapReduce processing framework. In this paper, we investigate Spark performance in processing data anonymization. Spark is a fast processing framework that was implemented in several applications such as: SQL, multimedia, and data stream. Our focus is the SQL Spark, which is adequate for big data anonymization. Since Spark operates in-memory, we need to observe its limitations, speed, and fault tolerance on data size increase, and to compare MapReduce to Spark in processing anonymity. Spark introduces an abstraction called resilient distributed datasets, which reads and serializes a collection of objects partitioned across a set of machines. Developers claim that Spark can outperform MapReduce by 10 times in iterative machine learning jobs. Our experiments in this paper compare between MapReduce and Spark. The overall results show a better performance for Spark's processing time in anonymity operations. However, in some limited cases, we prefer to implement the old MapReduce framework, when the cluster resources are limited and the network is non-congested.

引用

页数：26

共 50 条

[1] Sensitivity-based Anonymization of Big Data
Al-Zobbi, Mohammed
Shahrestani, Seyed
Ruan, Chun
[J]. PROCEEDINGS OF THE 2016 IEEE 41ST CONFERENCE ON LOCAL COMPUTER NETWORKS - LCN WORKSHOPS 2016, 2016, : 58 - 64
[2] Towards Optimal Sensitivity-Based Anonymization for Big Data
Al-Zobbi, Mohammed
Shahrestani, Seyed
Ruan, Chun
[J]. 2017 27TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2017, : 331 - 336
[3] Improving MapReduce privacy by implementing multi-dimensional sensitivity-based anonymization
Al-Zobbi M.
Shahrestani S.
Ruan C.
[J]. Al-Zobbi, Mohammed (m.alzobbi@westernsydney.edu.au), 2017, SpringerOpen (04)
[4] A Top-Down k-Anonymization Implementation for Apache Spark
Sopaoglu, Ugur
Abul, Osman
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4513 - 4521
[5] Apache Flink and clustering-based framework for fast anonymization of IoT stream data
Sadeghi-Nasab, Alireza
Ghaffarian, Hossein
Rahmani, Mohsen
[J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 20
[6] A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark
Bazai, Sibghat Ullah
Jang-Jaccard, Julian
Alavizadeh, Hooman
[J]. ACM TRANSACTIONS ON PRIVACY AND SECURITY, 2022, 25 (01)
[7] AXS: A Framework for Fast Astronomical Data Processing Based on Apache Spark
Zecevic, Petar
Slater, Colin T.
Juric, Mario
Connolly, Andrew J.
Loncaric, Sven
Bellm, Eric C.
Golkhou, V. Zach
Suberlak, Krzysztof
[J]. ASTRONOMICAL JOURNAL, 2019, 158 (01):
[8] Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark
Bazai, Sibghat Ullah
Jang-Jaccard, Julian
Alavizadeh, Hooman
[J]. ELECTRONICS, 2021, 10 (05) : 1 - 28
[9] A Dynamic Resource Allocation Framework for Apache Spark Applications
Wang, Kewen
Khan, Mohammad Maifi Hasan
Nguyen, Nhan
[J]. 2020 IEEE 44TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2020), 2020, : 997 - 1004
[10] HetSpark: A Framework that Provides Heterogeneous Executors to Apache Spark
Hidri, Klodjan Klodi
Bilas, Angelos
Kozanitis, Christos
[J]. 7TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE ON COMPUTATIONAL SCIENCE, YSC2018, 2018, 136 : 118 - 127

← 1 2 3 4 5 →