SparkNN: A distributed in-memory data partitioning for KNN queries on big spatial data

被引：0

作者：

Al Aghbari Z. ^{[1
]}

Ismail T. ^{[1
]}

Kamel I. ^{[2
]}

机构：

[1] Department of Computer Science, University of Sharjah

[2] Department of Computer Engineering, University of Sharjah

来源：

Data Science Journal | 2020年 / 19卷 / 01期

关键词：

Apache Spark; Bounding boxes; Global index; K-nearest neighbors; Kd-tree; Local index; Partitioning; Spatial data;

D O I：

10.5334/dsj-2020-035

中图分类号：

学科分类号：

摘要：

The increase in GPS-enabled devices and proliferation of location-based applications have resulted in an abundance of geotagged (spatial) data. As a consequence, numerous applications have emerged that utilize the spatial data to provide different types of location-based services. However, the huge amount of available spatial data presents a challenge to the efficiency of these location-based services. Although the advent of big data frameworks like Apache Spark has enabled the processing of large amounts of data efficiently, they are designed for general (non-spatial) data. That is due to the build-in data partitioning mechanism that does not take into account the spatial proximity of the data. Therefore, these big data frameworks cannot be readily used for spatial analytics such as efficiently answering spatial queries. To fill this gap, this paper proposes SparkNN, an in-memory partitioning and indexing system for answering spatial queries, such as K-nearest neighbor, on big spatial data. SparkNN is implemented on top of Apache Spark and consists of three layers to facilitate efficient spatial queries. The first layer is a spatial-aware partitioning layer, which partitions the spatial data into several partitions ensuring that the load of the partitions is balanced and data objects with close proximity are placed in the same, or neighboring, partitions. The second layer is a local indexing layer, which provides a spatial index inside each partition to speed up the data search within the partition. The third layer is a global index, which is placed in the master node of Spark to route spatial queries to the relevant partitions. The efficiency of SparkNN was evaluated by extensive experiments with big spatial datasets. The results show SparkNN significantly outperforms the state-of-the-art Spark system when evaluated on the same set of queries. © 2020 The Author(s).

引用

页码：1 / 14

页数：13

共 50 条

[31] Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters
Koliopoulos, Aris-Kyriakos
Yiapanis, Paraskevas
Tekiner, Firat
Nenadic, Goran
Keane, John
[J]. 2016 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2016, 2016, : 353 - 356
[32] MicroStream: A Distributed In-memory Caching Service For Data Production
Zhang, Mingming
Gao, Yunjun
He, Chuan
Tan, Tianyu
[J]. 2022 IEEE 13TH INTERNATIONAL CONFERENCE ON JOINT CLOUD COMPUTING (JCC 2022), 2022, : 17 - 22
[33] A Distributed In-Memory Database Solution for Mass Data Applications
Dong Hao
[J]. ZTE Communications, 2010, 8 (04) : 45 - 48
[34] A Compact In-memory Index for Managing Set Membership Queries on Streaming Data
Wang, Yong
Yun, Xiaochun
Wang, Shupeng
Wang, Xi
[J]. BIG DATA COMPUTING AND COMMUNICATIONS, (BIGCOM 2016), 2016, 9784 : 88 - 98
[35] Efficient In-Memory Evaluation of Reachability Graph Pattern Queries on Data Graphs
Wu, Xiaoying
Theodoratos, Dimitri
Skoutas, Dimitrios
Lan, Michael
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT I, 2022, : 55 - 71
[36] Employing In-Memory Data Grids for Distributed Graph Processing
Tasci, Serafettin
Demirbas, Murat
[J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 1856 - 1864
[37] Survey of In-memory Big Data Analytics and Latest Research Opportunities
Gangarde, Rupali
Pawar, Ambika
Dani, Ajay
[J]. 2016 FOURTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2016, : 197 - 201
[38] MemepiC: Towards a Unified In-Memory Big Data Management System
Cai, Qingchao
Zhang, Hao
Guo, Wentian
Chen, Gang
Ooi, Beng Chin
Tan, Kian-Lee
Wong, Weng-Fai
[J]. IEEE TRANSACTIONS ON BIG DATA, 2019, 5 (01) : 4 - 17
[39] Timo: In-Memory Temporal Query Processing for Big Temporal Data
Zheng, Xiao
Liu, Hou-kai
Wei, Lin-na
Wu, Xuan-gou
Zhang, Zhen
[J]. 2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 121 - 126
[40] In-Memory Computing Architectures for Big Data and Machine Learning Applications
Snasel, Vaclav
Tran Khanh Dang
Pham, Phuong N. H.
Kueng, Josef
Kong, Lingping
[J]. FUTURE DATA AND SECURITY ENGINEERING. BIG DATA, SECURITY AND PRIVACY, SMART CITY AND INDUSTRY 4.0 APPLICATIONS, FDSE 2022, 2022, 1688 : 19 - 33

← 1 2 3 4 5 →