SparkNN: A distributed in-memory data partitioning for KNN queries on big spatial data

被引：0

作者：

Al Aghbari, Zaher ^{[1
]}

Ismail, Tasneem ^{[1
]}

Kamel, Ibrahim ^{[2
]}

机构：

[1] Department of Computer Science, University of Sharjah, United Arab Emirates

[2] Department of Computer Engineering, University of Sharjah, United Arab Emirates

来源：

Data Science Journal | 2020年 / 19卷 / 01期

关键词：

Data handling - Motion compensation - Telecommunication services - Location based services - Nearest neighbor search - Trees (mathematics) - Encoding (symbols) - Big data - Indexing (of information) - Location;

D O I：

10.5334/dsj-2020-035

中图分类号：

学科分类号：

摘要：

The increase in GPS-enabled devices and proliferation of location-based applications have resulted in an abundance of geotagged (spatial) data. As a consequence, numerous applications have emerged that utilize the spatial data to provide different types of location-based services. However, the huge amount of available spatial data presents a challenge to the efficiency of these location-based services. Although the advent of big data frameworks like Apache Spark has enabled the processing of large amounts of data efficiently, they are designed for general (non-spatial) data. That is due to the build-in data partitioning mechanism that does not take into account the spatial proximity of the data. Therefore, these big data frameworks cannot be readily used for spatial analytics such as efficiently answering spatial queries. To fill this gap, this paper proposes SparkNN, an in-memory partitioning and indexing system for answering spatial queries, such as K-nearest neighbor, on big spatial data. SparkNN is implemented on top of Apache Spark and consists of three layers to facilitate efficient spatial queries. The first layer is a spatial-aware partitioning layer, which partitions the spatial data into several partitions ensuring that the load of the partitions is balanced and data objects with close proximity are placed in the same, or neighboring, partitions. The second layer is a local indexing layer, which provides a spatial index inside each partition to speed up the data search within the partition. The third layer is a global index, which is placed in the master node of Spark to route spatial queries to the relevant partitions. The efficiency of SparkNN was evaluated by extensive experiments with big spatial datasets. The results show SparkNN significantly outperforms the state-of-the-art Spark system when evaluated on the same set of queries. © 2020 The Author(s).

引用

页码：1 / 14

共 50 条

[1] LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data
Tang, Mingjie
Yu, Yongyang
Malluhi, Qutaibah M.
Ouzzani, Mourad
Aref, Walid G.
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2016, 9 (13): : 1565 - 1568
[2] Efficient spatial data partitioning for distributed kNN joins
Zeidan, Ayman
Vo, Huy T.
[J]. JOURNAL OF BIG DATA, 2022, 9 (01)
[3] Distributed In-Memory Analytics for Big Temporal Data
Yao, Bin
Zhang, Wei
Wang, Zhi-Jie
Chen, Zhongpu
Shang, Shuo
Zheng, Kai
Guo, Minyi
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2018, PT I, 2018, 10827 : 549 - 565
[4] Simba: Spatial In-Memory Big Data Analysis
Xie, Dong
Li, Feifei
Yao, Bin
Li, Gefei
Chen, Zhongpu
Zhou, Liang
Guo, Minyi
[J]. 24TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2016), 2016,
[5] In-memory Spatial-Aware Framework for Processing Proximity-Alike Queries in Big Spatial Data
Al Jawarneh, Isam Mashhour
Bellavista, Paolo
Corradi, Antonio
Foschini, Luca
Montanari, Rebecca
Zanotti, Andrea
[J]. 2018 IEEE 23RD INTERNATIONAL WORKSHOP ON COMPUTER AIDED MODELING AND DESIGN OF COMMUNICATION LINKS AND NETWORKS (CAMAD), 2018, : 86 - 91
[6] Big data availability: Selective partial checkpointing for in-memory database queries
Playfair, Daniel
Trehan, Amitabh
McLarnon, Barry
Nikolopoulos, Dimitrios S.
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 2785 - 2794
[7] In-Memory Performance for Big Data
Graefe, Goetz
Volos, Haris
Kimura, Hideaki
Kuno, Harumi
Tucek, Joseph
Lillibridge, Mark
Veitch, Alistair
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (01): : 37 - 48
[8] Distributed approach of continuous queries with KNN join processing in spatial data warehouse
Gorawski, Marcin
Gebczyk, Wojciech
[J]. ICEIS 2007: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS: DATABASES AND INFORMATION SYSTEMS INTEGRATION, 2007, : 131 - 136
[9] Design and implementation of reconfigurable acceleration for in-memory distributed big data computing
Hou, Junjie
Zhu, Yongxin
Du, Sen
Song, Shijin
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 92 : 68 - 75
[10] ClimateSpark: An in-memory distributed computing framework for big climate data analytics
Hu, Fei
Yang, Chaowei
Schnase, John L.
Duffy, Daniel Q.
Xu, Mengchao
Bowen, Michael K.
Lee, Tsengdar
Song, Weiwei
[J]. COMPUTERS & GEOSCIENCES, 2018, 115 : 154 - 166

← 1 2 3 4 5 →