SparkNN: A distributed in-memory data partitioning for KNN queries on big spatial data

被引:0
|
作者
Al Aghbari, Zaher [1 ]
Ismail, Tasneem [1 ]
Kamel, Ibrahim [2 ]
机构
[1] Department of Computer Science, University of Sharjah, United Arab Emirates
[2] Department of Computer Engineering, University of Sharjah, United Arab Emirates
关键词
Data handling - Motion compensation - Telecommunication services - Location based services - Nearest neighbor search - Trees (mathematics) - Encoding (symbols) - Big data - Indexing (of information) - Location;
D O I
10.5334/dsj-2020-035
中图分类号
学科分类号
摘要
The increase in GPS-enabled devices and proliferation of location-based applications have resulted in an abundance of geotagged (spatial) data. As a consequence, numerous applications have emerged that utilize the spatial data to provide different types of location-based services. However, the huge amount of available spatial data presents a challenge to the efficiency of these location-based services. Although the advent of big data frameworks like Apache Spark has enabled the processing of large amounts of data efficiently, they are designed for general (non-spatial) data. That is due to the build-in data partitioning mechanism that does not take into account the spatial proximity of the data. Therefore, these big data frameworks cannot be readily used for spatial analytics such as efficiently answering spatial queries. To fill this gap, this paper proposes SparkNN, an in-memory partitioning and indexing system for answering spatial queries, such as K-nearest neighbor, on big spatial data. SparkNN is implemented on top of Apache Spark and consists of three layers to facilitate efficient spatial queries. The first layer is a spatial-aware partitioning layer, which partitions the spatial data into several partitions ensuring that the load of the partitions is balanced and data objects with close proximity are placed in the same, or neighboring, partitions. The second layer is a local indexing layer, which provides a spatial index inside each partition to speed up the data search within the partition. The third layer is a global index, which is placed in the master node of Spark to route spatial queries to the relevant partitions. The efficiency of SparkNN was evaluated by extensive experiments with big spatial datasets. The results show SparkNN significantly outperforms the state-of-the-art Spark system when evaluated on the same set of queries. © 2020 The Author(s).
引用
收藏
页码:1 / 14
相关论文
共 50 条
  • [1] LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data
    Tang, Mingjie
    Yu, Yongyang
    Malluhi, Qutaibah M.
    Ouzzani, Mourad
    Aref, Walid G.
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2016, 9 (13): : 1565 - 1568
  • [2] Efficient spatial data partitioning for distributed kNN joins
    Zeidan, Ayman
    Vo, Huy T.
    [J]. JOURNAL OF BIG DATA, 2022, 9 (01)
  • [3] Distributed In-Memory Analytics for Big Temporal Data
    Yao, Bin
    Zhang, Wei
    Wang, Zhi-Jie
    Chen, Zhongpu
    Shang, Shuo
    Zheng, Kai
    Guo, Minyi
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2018, PT I, 2018, 10827 : 549 - 565
  • [4] Simba: Spatial In-Memory Big Data Analysis
    Xie, Dong
    Li, Feifei
    Yao, Bin
    Li, Gefei
    Chen, Zhongpu
    Zhou, Liang
    Guo, Minyi
    [J]. 24TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2016), 2016,
  • [5] In-memory Spatial-Aware Framework for Processing Proximity-Alike Queries in Big Spatial Data
    Al Jawarneh, Isam Mashhour
    Bellavista, Paolo
    Corradi, Antonio
    Foschini, Luca
    Montanari, Rebecca
    Zanotti, Andrea
    [J]. 2018 IEEE 23RD INTERNATIONAL WORKSHOP ON COMPUTER AIDED MODELING AND DESIGN OF COMMUNICATION LINKS AND NETWORKS (CAMAD), 2018, : 86 - 91
  • [6] Big data availability: Selective partial checkpointing for in-memory database queries
    Playfair, Daniel
    Trehan, Amitabh
    McLarnon, Barry
    Nikolopoulos, Dimitrios S.
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 2785 - 2794
  • [7] In-Memory Performance for Big Data
    Graefe, Goetz
    Volos, Haris
    Kimura, Hideaki
    Kuno, Harumi
    Tucek, Joseph
    Lillibridge, Mark
    Veitch, Alistair
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (01): : 37 - 48
  • [8] Distributed approach of continuous queries with KNN join processing in spatial data warehouse
    Gorawski, Marcin
    Gebczyk, Wojciech
    [J]. ICEIS 2007: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS: DATABASES AND INFORMATION SYSTEMS INTEGRATION, 2007, : 131 - 136
  • [9] Design and implementation of reconfigurable acceleration for in-memory distributed big data computing
    Hou, Junjie
    Zhu, Yongxin
    Du, Sen
    Song, Shijin
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 92 : 68 - 75
  • [10] ClimateSpark: An in-memory distributed computing framework for big climate data analytics
    Hu, Fei
    Yang, Chaowei
    Schnase, John L.
    Duffy, Daniel Q.
    Xu, Mengchao
    Bowen, Michael K.
    Lee, Tsengdar
    Song, Weiwei
    [J]. COMPUTERS & GEOSCIENCES, 2018, 115 : 154 - 166