Efficient and Customizable Data Partitioning Framework for Distributed Big RDF Data Processing in the Cloud

被引:26
|
作者
Lee, Kisung [1 ]
Liu, Ling [1 ]
Tang, Yuzhe [1 ]
Zhang, Qi [1 ]
Zhou, Yang [1 ]
机构
[1] Georgia Inst Technol, Coll Comp, DiSL, Atlanta, GA 30332 USA
关键词
D O I
10.1109/CLOUD.2013.63
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Big data business can leverage and benefit from the Clouds, the most optimized, shared, automated, and virtualized computing infrastructures. One of the important challenges in processing big data in the Clouds is how to effectively partition the big data to ensure efficient distributed processing of the data. In this paper we present a Scalable and yet customizable data PArtitioning framework, called SPA, for distributed processing of big RDF graph data. We choose big RDF datasets as our focus of the investigation for two reasons. First, the Linking Open Data cloud has put forwards a good number of big RDF datasets with tens of billions of triples and hundreds of millions of links. Second, such huge RDF graphs can easily overwhelm any single server due to the limited memory and CPU capacity and exceed the processing capacity of many conventional data processing software systems. Our data partitioning framework has two unique features. First, we introduce a suite of vertexcentric data partitioning building blocks to allow efficient and yet customizable partitioning of large heterogeneous RDF graph data. By efficient, we mean that the SPA data partitions can support fast processing of big data of different sizes and complexity. By customizable, we mean that the SPA partitions are adaptive to different query types. Second, we propose a selection of scalable techniques to distribute the building block partitions across a cluster of compute nodes in a manner that minimizes inter-node communication cost by localizing most of the queries on distributed partitions. We evaluate our data partitioning framework and algorithms through extensive experiments using both benchmark and real datasets. Our experimental results show that the SPA data partitioning framework is not only efficient for partitioning and distributing big RDF datasets of diverse sizes and structures but also effective for processing big data queries of different types and complexity.
引用
收藏
页码:327 / 334
页数:8
相关论文
共 50 条
  • [1] Distributed Join Query Processing for Big RDF Data
    Elzein, Nahla Mohammed
    Majid, Mazlina Abdul
    Fakherldin, Mohammed
    Hashem, Ibrahim Abaker Targio
    [J]. ADVANCED SCIENCE LETTERS, 2018, 24 (10) : 7758 - 7761
  • [2] Adaptive mechanism for distributed query processing and data loading using the RDF data in the cloud
    Dharmaraj, Chandrasekaran Ranichandra
    Tripathy, BalaKrushna
    [J]. INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2018, 31 (15)
  • [3] An efficient framework for processing big data in internet of things enabled cloud environments
    Lohitha, Sai N.
    Kumar, Pounambal Muthu
    [J]. INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2022, 35 (10)
  • [4] An Efficient Distributed Algorithm for Big Data Processing
    Al-kahtani, Mohammed S.
    Karim, Lutful
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2017, 42 (08) : 3149 - 3157
  • [5] An Efficient Distributed Algorithm for Big Data Processing
    Mohammed S. Al-kahtani
    Lutful Karim
    [J]. Arabian Journal for Science and Engineering, 2017, 42 : 3149 - 3157
  • [6] Architecture for distributed query processing using the RDF data in cloud environment
    Ranichandra, C.
    Tripathy, B. K.
    [J]. EVOLUTIONARY INTELLIGENCE, 2021, 14 (02) : 567 - 575
  • [7] Architecture for distributed query processing using the RDF data in cloud environment
    C. Ranichandra
    B. K. Tripathy
    [J]. Evolutionary Intelligence, 2021, 14 : 567 - 575
  • [8] Distributed In Situ Processing of Big Raster Data in the Cloud
    Zalipynis, Ramon Antonio Rodriges
    [J]. PERSPECTIVES OF SYSTEM INFORMATICS, PSI 2017, 2018, 10742 : 337 - 351
  • [9] Scalable Data Partitioning Techniques for Distributed Data Processing in Cloud Environments: A Review
    Ponnusamy, Sivakumar
    Gupta, Pankaj
    [J]. IEEE ACCESS, 2024, 12 : 26735 - 26746
  • [10] Data Partitioning Scheme for Efficient Distributed RDF Querying Using Apache Spark
    Hassan, Mahmudul
    Bansal, Srividya K.
    [J]. 2019 13TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2019, : 24 - 31