Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses

被引:10
|
作者
Costa, Eduarda [1 ]
Costa, Carlos [1 ]
Santos, Maribel Yasmina [1 ]
机构
[1] Univ Minho, ALGORITMI Res Ctr, P-4800058 Guimaraes, Portugal
来源
关键词
Big Data; Data Warehousing; Hive; Modelling; Partitioning;
D O I
10.1007/978-3-319-65930-5_1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The amount of data has increased exponentially as a consequence of the availability of new data sources and the advances in data collection and storage. This data explosion was accompanied by the popularization of the Big Data term, addressing large volumes of data, with several degrees of complexity, often without structure and organization, which cannot be processed or analyzed using traditional processes or tools. Moving towards Big Data Warehouses (BDWs) brings new problems and implies the adoption of new logical data models and tools to query them. Hive is a DW system for Big Data contexts that organizes the data into tables, partitions and buckets. Several studies have been conducted to understand ways of optimizing its performance in data storage and processing, but few of them explore whether the way data is structured has any influence on how quickly Hive responds to queries. This paper investigates the role of data organization and modelling in the processing times of BDWs implemented in Hive, benchmarking multidimensional star schemas and fully denormalized tables with different Scale Factors (SFs), and analyzing the impact of adequate data partitioning in these two data modelling strategies.
引用
收藏
页码:3 / 16
页数:14
相关论文
共 50 条
  • [1] Hive-Based Anomaly Detection in Hadoop Log Data Management
    Son, Siwoon
    Gil, Myeong-Seon
    Yang, Seokwoo
    Moon, Yang-Sae
    [J]. ADVANCES IN COMPUTER SCIENCE AND UBIQUITOUS COMPUTING, 2017, 421 : 837 - 842
  • [2] Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
    Costa, Eduarda
    Costa, Carlos
    Santos, Maribel Yasmina
    [J]. JOURNAL OF BIG DATA, 2019, 6 (01)
  • [3] Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
    Eduarda Costa
    Carlos Costa
    Maribel Yasmina Santos
    [J]. Journal of Big Data, 6
  • [4] SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop
    Ramdane, Yassine
    Kabachi, Nadia
    Boussaid, Omar
    Bentayeb, Fadila
    [J]. BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2019, 2019, 11708 : 189 - 205
  • [5] Optimization of Multiple Queries for Big Data with Apache Hadoop/Hive
    Garg, Varun
    [J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 938 - 941
  • [6] SkipSJoin: A New Physical Design for Distributed Big Data Warehouses in Hadoop
    Ramdane, Yassine
    Kabachi, Nadia
    Boussaid, Omar
    Bentayeb, Fadila
    [J]. CONCEPTUAL MODELING, ER 2019, 2019, 11788 : 255 - 263
  • [7] Efficient Big Data Processing in Hadoop MapReduce
    Dittrich, Jens
    Quiane-Ruiz, Jorge-Arnulfo
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 2014 - 2015
  • [8] Towards Efficient Big Data: Hadoop Data Placing and Processing
    Bahadi, Jihane
    El Asri, Bouchra
    Courtine, Melanie
    Rhanoui, Maryem
    Kergosien, Yannick
    [J]. 2ND INTERNATIONAL CONFERENCE ON SMART DIGITAL ENVIRONMENT (ICSDE'18), 2018, : 42 - 47
  • [9] Building Data Warehouses in the Era of Big Data An Approach for Scalable and Flexible Big Data Warehouses
    Costa, Carlos
    Santos, Maribel Yasmina
    [J]. ADVANCED INFORMATION SYSTEMS ENGINEERING (CAISE 2019), 2019, 11483 : 693 - 695
  • [10] Augmenting Data Warehouses with Big Data
    Jukic, Nenad
    Sharma, Abhishek
    Nestorov, Svetlozar
    Jukic, Boris
    [J]. INFORMATION SYSTEMS MANAGEMENT, 2015, 32 (03) : 200 - 209