Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses

被引:10
|
作者
Costa, Eduarda [1 ]
Costa, Carlos [1 ]
Santos, Maribel Yasmina [1 ]
机构
[1] Univ Minho, ALGORITMI Res Ctr, P-4800058 Guimaraes, Portugal
来源
关键词
Big Data; Data Warehousing; Hive; Modelling; Partitioning;
D O I
10.1007/978-3-319-65930-5_1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The amount of data has increased exponentially as a consequence of the availability of new data sources and the advances in data collection and storage. This data explosion was accompanied by the popularization of the Big Data term, addressing large volumes of data, with several degrees of complexity, often without structure and organization, which cannot be processed or analyzed using traditional processes or tools. Moving towards Big Data Warehouses (BDWs) brings new problems and implies the adoption of new logical data models and tools to query them. Hive is a DW system for Big Data contexts that organizes the data into tables, partitions and buckets. Several studies have been conducted to understand ways of optimizing its performance in data storage and processing, but few of them explore whether the way data is structured has any influence on how quickly Hive responds to queries. This paper investigates the role of data organization and modelling in the processing times of BDWs implemented in Hive, benchmarking multidimensional star schemas and fully denormalized tables with different Scale Factors (SFs), and analyzing the impact of adequate data partitioning in these two data modelling strategies.
引用
收藏
页码:3 / 16
页数:14
相关论文
共 50 条
  • [31] Privacy preserving data publishing based on sensitivity in context of Big Data using Hive
    Rao P.S.
    Satyanarayana S.
    [J]. Journal of Big Data, 5 (1)
  • [32] A Novel Clustering Technique for Efficient Clustering of Big Data in Hadoop Ecosystem
    Kumar, Sunil
    Singh, Maninder
    [J]. BIG DATA MINING AND ANALYTICS, 2019, 2 (04): : 240 - 247
  • [33] The Efficient Implementation of Distributed Indexing with Hadoop for Digital Investigations on Big Data
    Lee, Taerim
    Lee, Hyejoo
    Rhee, Kyung-Hyune
    Shin, Sang Uk
    [J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2014, 11 (03) : 1037 - 1054
  • [34] A special issue in extending data warehouses to big data analytics
    Bellatreche, Ladjel
    Chakravarthy, Sharma
    [J]. DISTRIBUTED AND PARALLEL DATABASES, 2019, 37 (03) : 323 - 327
  • [35] A special issue in extending data warehouses to big data analytics
    Ladjel Bellatreche
    Sharma Chakravarthy
    [J]. Distributed and Parallel Databases, 2019, 37 : 323 - 327
  • [36] Big Data: Performance Profiling of Meteorological and Oceanographic Data on Hive
    Abdullahi, Ali Usman
    Ahmad, Rohiza
    Zakaria, Nordin M.
    [J]. 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCES (ICCOINS), 2016, : 203 - 208
  • [37] An efficient Hadoop-based brain tumor detection framework using big data analytic
    Kaur Chahal, Prabhjot
    Pandey, Shreelekha
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2022, 52 (03): : 805 - 818
  • [38] Modelling and querying geographical data warehouses
    da Silva, Joel
    de Oliveira, Anjolina G.
    Fidalgo, Robson N.
    Salgado, Ana Carolina
    Times, Valeria C.
    [J]. INFORMATION SYSTEMS, 2010, 35 (05) : 592 - 614
  • [39] Big data and Spark: Comparison with Hadoop
    Benlachmi, Yassine
    Hasnaoui, Moulay Lahcen
    [J]. PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 811 - 817
  • [40] Handling Big Data with Hadoop Toolkit
    Devakunchari, R.
    [J]. 2014 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2014,