SparkNN: A distributed in-memory data partitioning for KNN queries on big spatial data

被引:0
|
作者
Al Aghbari Z. [1 ]
Ismail T. [1 ]
Kamel I. [2 ]
机构
[1] Department of Computer Science, University of Sharjah
[2] Department of Computer Engineering, University of Sharjah
关键词
Apache Spark; Bounding boxes; Global index; K-nearest neighbors; Kd-tree; Local index; Partitioning; Spatial data;
D O I
10.5334/dsj-2020-035
中图分类号
学科分类号
摘要
The increase in GPS-enabled devices and proliferation of location-based applications have resulted in an abundance of geotagged (spatial) data. As a consequence, numerous applications have emerged that utilize the spatial data to provide different types of location-based services. However, the huge amount of available spatial data presents a challenge to the efficiency of these location-based services. Although the advent of big data frameworks like Apache Spark has enabled the processing of large amounts of data efficiently, they are designed for general (non-spatial) data. That is due to the build-in data partitioning mechanism that does not take into account the spatial proximity of the data. Therefore, these big data frameworks cannot be readily used for spatial analytics such as efficiently answering spatial queries. To fill this gap, this paper proposes SparkNN, an in-memory partitioning and indexing system for answering spatial queries, such as K-nearest neighbor, on big spatial data. SparkNN is implemented on top of Apache Spark and consists of three layers to facilitate efficient spatial queries. The first layer is a spatial-aware partitioning layer, which partitions the spatial data into several partitions ensuring that the load of the partitions is balanced and data objects with close proximity are placed in the same, or neighboring, partitions. The second layer is a local indexing layer, which provides a spatial index inside each partition to speed up the data search within the partition. The third layer is a global index, which is placed in the master node of Spark to route spatial queries to the relevant partitions. The efficiency of SparkNN was evaluated by extensive experiments with big spatial datasets. The results show SparkNN significantly outperforms the state-of-the-art Spark system when evaluated on the same set of queries. © 2020 The Author(s).
引用
收藏
页码:1 / 14
页数:13
相关论文
共 50 条
  • [31] Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters
    Koliopoulos, Aris-Kyriakos
    Yiapanis, Paraskevas
    Tekiner, Firat
    Nenadic, Goran
    Keane, John
    [J]. 2016 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2016, 2016, : 353 - 356
  • [32] MicroStream: A Distributed In-memory Caching Service For Data Production
    Zhang, Mingming
    Gao, Yunjun
    He, Chuan
    Tan, Tianyu
    [J]. 2022 IEEE 13TH INTERNATIONAL CONFERENCE ON JOINT CLOUD COMPUTING (JCC 2022), 2022, : 17 - 22
  • [33] A Distributed In-Memory Database Solution for Mass Data Applications
    Dong Hao
    [J]. ZTE Communications, 2010, 8 (04) : 45 - 48
  • [34] A Compact In-memory Index for Managing Set Membership Queries on Streaming Data
    Wang, Yong
    Yun, Xiaochun
    Wang, Shupeng
    Wang, Xi
    [J]. BIG DATA COMPUTING AND COMMUNICATIONS, (BIGCOM 2016), 2016, 9784 : 88 - 98
  • [35] Efficient In-Memory Evaluation of Reachability Graph Pattern Queries on Data Graphs
    Wu, Xiaoying
    Theodoratos, Dimitri
    Skoutas, Dimitrios
    Lan, Michael
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT I, 2022, : 55 - 71
  • [36] Employing In-Memory Data Grids for Distributed Graph Processing
    Tasci, Serafettin
    Demirbas, Murat
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 1856 - 1864
  • [37] Survey of In-memory Big Data Analytics and Latest Research Opportunities
    Gangarde, Rupali
    Pawar, Ambika
    Dani, Ajay
    [J]. 2016 FOURTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2016, : 197 - 201
  • [38] MemepiC: Towards a Unified In-Memory Big Data Management System
    Cai, Qingchao
    Zhang, Hao
    Guo, Wentian
    Chen, Gang
    Ooi, Beng Chin
    Tan, Kian-Lee
    Wong, Weng-Fai
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2019, 5 (01) : 4 - 17
  • [39] Timo: In-Memory Temporal Query Processing for Big Temporal Data
    Zheng, Xiao
    Liu, Hou-kai
    Wei, Lin-na
    Wu, Xuan-gou
    Zhang, Zhen
    [J]. 2019 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD), 2019, : 121 - 126
  • [40] In-Memory Computing Architectures for Big Data and Machine Learning Applications
    Snasel, Vaclav
    Tran Khanh Dang
    Pham, Phuong N. H.
    Kueng, Josef
    Kong, Lingping
    [J]. FUTURE DATA AND SECURITY ENGINEERING. BIG DATA, SECURITY AND PRIVACY, SMART CITY AND INDUSTRY 4.0 APPLICATIONS, FDSE 2022, 2022, 1688 : 19 - 33