Cost-Based Optimization of Logical Partitions for a Query Workload in a Hadoop Data Warehouse

被引:0
|
作者
Peng, Shu [1 ]
Gu, Jun [1 ]
Wang, X. Sean [1 ]
Rao, Weixiong [2 ]
Yang, Min [1 ]
Cao, Yu [3 ]
机构
[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch Comp Sci, Shanghai 200433, Peoples R China
[2] Tongji Univ, Sch Software Engn, Shanghai, Peoples R China
[3] EMC Labs, Beijing, Peoples R China
关键词
Hadoop; Data Analysis; Data Partition; Query Workload; Cost-based Optimization; DATA PLACEMENT; MAPREDUCE;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, Hadoop has become a common programming framework for big data analysis on a cluster of commodity machines. To optimize queries on a large amount of data managed by the Hadoop Distributed File System (HDFS), it is particularly important to optimize the reading of the data. Previous works either designed file formats to cluster data belonging to the same column, or proposed to place correlated data onto the same physical nodes. In query-workload aware situation, a possible optimization strategy is to place data that may not be used by the same query into different logical partitions so that not every partition is needed for a query, while physically distribute the data in each partition evenly across the compute nodes. This paper proposes a condition-based partitioning scheme to implement this optimization strategy. Experiments show that the proposed scheme not only reduces the I/O cost, but also maintains the workload of the compute nodes balanced across the cluster.
引用
收藏
页码:559 / 567
页数:9
相关论文
共 50 条
  • [1] Cost-based Query Optimization for XPath
    Li, Dong
    Chen, Wenhao
    Liang, Xiaochong
    Guan, Jida
    Xu, Yang
    Lu, Xiuyu
    [J]. APPLIED MATHEMATICS & INFORMATION SCIENCES, 2014, 8 (04): : 1935 - 1948
  • [2] Cost-based Optimization of Multistore Query Plans
    Forresi, Chiara
    Francia, Matteo
    Gallinucci, Enrico
    Golfarelli, Matteo
    [J]. INFORMATION SYSTEMS FRONTIERS, 2023, 25 (05) : 1925 - 1951
  • [3] Cost-based Optimization of Multistore Query Plans
    Chiara Forresi
    Matteo Francia
    Enrico Gallinucci
    Matteo Golfarelli
    [J]. Information Systems Frontiers, 2023, 25 : 1925 - 1951
  • [4] Cost-Based Query Optimization via AI Planning
    Robinson, Nathan
    McIlraith, Sheila A.
    Toman, David
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 2344 - 2351
  • [5] Cost-based query optimization for multi reachability joins
    Cheng, Jiefeng
    Yu, Jeffrey Xu
    Ding, Bolin
    [J]. ADVANCES IN DATABASES: CONCEPTS, SYSTEMS AND APPLICATIONS, 2007, 4443 : 18 - +
  • [6] CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
    Saleem, Muhammad
    Potocki, Alexander
    Soru, Tommaso
    Hartig, Olaf
    Ngomo, Axel-Cyrille Ngonga
    [J]. PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON SEMANTIC SYSTEMS, 2018, 137 : 163 - 174
  • [7] Materialized view selection based on query cost in data warehouse
    Zhou, LJ
    Liu, C
    Liu, D
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY: THEORY, TOOLS, AND TECHNOLOGY VI, 2004, 5433 : 246 - 252
  • [8] Cost-Based Join Algorithm Selection in Hadoop
    Gu, Jun
    Peng, Shu
    Wang, X. Sean
    Rao, Weixiong
    Yang, Min
    Cao, Yu
    [J]. WEB INFORMATION SYSTEMS ENGINEERING, PT II, 2014, 8787 : 246 - 261
  • [9] Cost-Based Data-Partitioning for Intra-Query Parallelism
    Liu, Yanchen
    Mortazavi, Masood
    Cao, Fang
    Chen, Mengmeng
    Shi, Guangyu
    [J]. DATABASES AND INFORMATION SYSTEMS VIII, 2014, 270 : 233 - 244
  • [10] GSLPI: a Cost-based Query Progress Indicator
    Li, Jiexing
    Nehme, Rimma V.
    Naughton, Jeffrey
    [J]. 2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 678 - 689