Importance of Data Distribution on Hive-based Systems for Query Performance: An Experimental Study

被引：1

作者：

Ciritoglu, Hilmi Egemen ^{[1
]}

Murphy, John ^{[1
]}

Thorpe, Christina ^{[1
,2
]}

机构：

[1] Univ Coll Dublin, Sch Comp Sci, Performance Engn Lab, Dublin, Ireland

[2] Technol Univ Dublin, Dublin, Ireland

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020) | 2020年

基金：

爱尔兰科学基金会;

关键词：

SQL-on-Hadoop; Hadoop; HDFS; Data distribution; Software Performance;

D O I：

10.1109/BigComp48618.2020.00-47

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

SQL-on-Hadoop systems have been gaining popularity in recent years. One popular example of SQL-on-Hadoop systems is Apache Hive; the pioneer of SQL-on-Hadoop systems. Hive is located on the top of big data stack as an application layer. Besides the application layer, the Hadoop Ecosystem is composed of 3 different main layers: storage, the resource manager and processing engine. The demand from industry has led to the development of new efficient components for each layer. As the ecosystem evolves over time, Hive employed different execution engines too. Understanding the strengths of components is very important in order to exploit the full performance of the Hadoop Ecosystem. Therefore, recent works in the literature study the importance of each layer separately. To the best of our knowledge, the present work is the first work that focuses on the performance of the combination of both the storage layer and the execution engine. In this work, we compare the Hive's query performance by using three different execution engines: MR, Tez and Spark on the skewed/well-balanced data distribution through the full TPC-H benchmark. Our results show the importance of data distribution on the storage layer for overall job performance of SQL-on-Hadoop systems and empirically showed even distribution improves performance up to 48% compared to skewed distribution. Moreover, the present study provides insightful findings by identifying particular SQL query cases that the certain processing engine deals exceptionally well.

引用

页码：370 / 376

页数：7

共 50 条

[1] Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
Costa, Eduarda
Costa, Carlos
Santos, Maribel Yasmina
[J]. JOURNAL OF BIG DATA, 2019, 6 (01)
[2] Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
Eduarda Costa
Carlos Costa
Maribel Yasmina Santos
[J]. Journal of Big Data, 6
[3] Hive-Based Anomaly Detection in Hadoop Log Data Management
Son, Siwoon
Gil, Myeong-Seon
Yang, Seokwoo
Moon, Yang-Sae
[J]. ADVANCES IN COMPUTER SCIENCE AND UBIQUITOUS COMPUTING, 2017, 421 : 837 - 842
[4] Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses
Costa, Eduarda
Costa, Carlos
Santos, Maribel Yasmina
[J]. INFORMATION SYSTEMS, EMCIS 2017, 2017, 299 : 3 - 16
[5] The distribution and query systems of the RCSB protein data bank
Bourne, PE
Addess, KJ
Bluhm, WF
Chen, L
Deshpande, N
Feng, ZK
Fleri, W
Green, R
Merino-Ott, JC
Townsend-Merino, W
Weissig, H
Westbrook, J
Berman, HM
[J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D223 - D225
[6] Integrated High-Performance Platform for Fast Query Response in Big Data with Hive, Impala, and SparkSQL: A Performance Evaluation
Chang, Bao Rong
Tsai, Hsiu-Fen
Lee, Yun-Da
[J]. APPLIED SCIENCES-BASEL, 2018, 8 (09):
[7] Adaptive Query Relaxation and Result Categorization Based on Data Distribution and Query Context
Zhang, Xiaoyan
Meng, Xiangfu
Tang, Yanhuan
Bi, Chongchun
[J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2017, 33 (05) : 1375 - 1396
[8] Quality assurance for the query and distribution systems of the RCSB Protein Data Bank
Bluhm, Wolfgang F.
Beran, Bojan
Bi, Chunxiao
Dimitropoulos, Dimitris
Prlic, Andreas
Quinn, Gregory B.
Rose, Peter W.
Shah, Chaitali
Young, Jasmine
Yukich, Benjamin
Berman, Helen M.
Bourne, Philip E.
[J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2011,
[9] A Study on the Improvement of Query Processing Performance of OWL Data based on Jena2
Heo, Sun-Young
Kim, Eun-Gyung
[J]. ICHIT 2008: INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY, PROCEEDINGS, 2008, : 678 - 681
[10] Improved Performance of Hive using Index-Based Operation on Big Data
Suman, Akshay Kumar
Gyanchandani, Manasi
[J]. PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 1974 - 1978

← 1 2 3 4 5 →