Importance of Data Distribution on Hive-based Systems for Query Performance: An Experimental Study

被引:1
|
作者
Ciritoglu, Hilmi Egemen [1 ]
Murphy, John [1 ]
Thorpe, Christina [1 ,2 ]
机构
[1] Univ Coll Dublin, Sch Comp Sci, Performance Engn Lab, Dublin, Ireland
[2] Technol Univ Dublin, Dublin, Ireland
基金
爱尔兰科学基金会;
关键词
SQL-on-Hadoop; Hadoop; HDFS; Data distribution; Software Performance;
D O I
10.1109/BigComp48618.2020.00-47
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
SQL-on-Hadoop systems have been gaining popularity in recent years. One popular example of SQL-on-Hadoop systems is Apache Hive; the pioneer of SQL-on-Hadoop systems. Hive is located on the top of big data stack as an application layer. Besides the application layer, the Hadoop Ecosystem is composed of 3 different main layers: storage, the resource manager and processing engine. The demand from industry has led to the development of new efficient components for each layer. As the ecosystem evolves over time, Hive employed different execution engines too. Understanding the strengths of components is very important in order to exploit the full performance of the Hadoop Ecosystem. Therefore, recent works in the literature study the importance of each layer separately. To the best of our knowledge, the present work is the first work that focuses on the performance of the combination of both the storage layer and the execution engine. In this work, we compare the Hive's query performance by using three different execution engines: MR, Tez and Spark on the skewed/well-balanced data distribution through the full TPC-H benchmark. Our results show the importance of data distribution on the storage layer for overall job performance of SQL-on-Hadoop systems and empirically showed even distribution improves performance up to 48% compared to skewed distribution. Moreover, the present study provides insightful findings by identifying particular SQL query cases that the certain processing engine deals exceptionally well.
引用
收藏
页码:370 / 376
页数:7
相关论文
共 50 条
  • [1] Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
    Costa, Eduarda
    Costa, Carlos
    Santos, Maribel Yasmina
    [J]. JOURNAL OF BIG DATA, 2019, 6 (01)
  • [2] Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
    Eduarda Costa
    Carlos Costa
    Maribel Yasmina Santos
    [J]. Journal of Big Data, 6
  • [3] Hive-Based Anomaly Detection in Hadoop Log Data Management
    Son, Siwoon
    Gil, Myeong-Seon
    Yang, Seokwoo
    Moon, Yang-Sae
    [J]. ADVANCES IN COMPUTER SCIENCE AND UBIQUITOUS COMPUTING, 2017, 421 : 837 - 842
  • [4] Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses
    Costa, Eduarda
    Costa, Carlos
    Santos, Maribel Yasmina
    [J]. INFORMATION SYSTEMS, EMCIS 2017, 2017, 299 : 3 - 16
  • [5] The distribution and query systems of the RCSB protein data bank
    Bourne, PE
    Addess, KJ
    Bluhm, WF
    Chen, L
    Deshpande, N
    Feng, ZK
    Fleri, W
    Green, R
    Merino-Ott, JC
    Townsend-Merino, W
    Weissig, H
    Westbrook, J
    Berman, HM
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D223 - D225
  • [6] Integrated High-Performance Platform for Fast Query Response in Big Data with Hive, Impala, and SparkSQL: A Performance Evaluation
    Chang, Bao Rong
    Tsai, Hsiu-Fen
    Lee, Yun-Da
    [J]. APPLIED SCIENCES-BASEL, 2018, 8 (09):
  • [7] Adaptive Query Relaxation and Result Categorization Based on Data Distribution and Query Context
    Zhang, Xiaoyan
    Meng, Xiangfu
    Tang, Yanhuan
    Bi, Chongchun
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2017, 33 (05) : 1375 - 1396
  • [8] Quality assurance for the query and distribution systems of the RCSB Protein Data Bank
    Bluhm, Wolfgang F.
    Beran, Bojan
    Bi, Chunxiao
    Dimitropoulos, Dimitris
    Prlic, Andreas
    Quinn, Gregory B.
    Rose, Peter W.
    Shah, Chaitali
    Young, Jasmine
    Yukich, Benjamin
    Berman, Helen M.
    Bourne, Philip E.
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2011,
  • [9] A Study on the Improvement of Query Processing Performance of OWL Data based on Jena2
    Heo, Sun-Young
    Kim, Eun-Gyung
    [J]. ICHIT 2008: INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY, PROCEEDINGS, 2008, : 678 - 681
  • [10] Improved Performance of Hive using Index-Based Operation on Big Data
    Suman, Akshay Kumar
    Gyanchandani, Manasi
    [J]. PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 1974 - 1978