Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses

被引:10
|
作者
Costa, Eduarda [1 ]
Costa, Carlos [1 ]
Santos, Maribel Yasmina [1 ]
机构
[1] Univ Minho, ALGORITMI Res Ctr, P-4800058 Guimaraes, Portugal
来源
关键词
Big Data; Data Warehousing; Hive; Modelling; Partitioning;
D O I
10.1007/978-3-319-65930-5_1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The amount of data has increased exponentially as a consequence of the availability of new data sources and the advances in data collection and storage. This data explosion was accompanied by the popularization of the Big Data term, addressing large volumes of data, with several degrees of complexity, often without structure and organization, which cannot be processed or analyzed using traditional processes or tools. Moving towards Big Data Warehouses (BDWs) brings new problems and implies the adoption of new logical data models and tools to query them. Hive is a DW system for Big Data contexts that organizes the data into tables, partitions and buckets. Several studies have been conducted to understand ways of optimizing its performance in data storage and processing, but few of them explore whether the way data is structured has any influence on how quickly Hive responds to queries. This paper investigates the role of data organization and modelling in the processing times of BDWs implemented in Hive, benchmarking multidimensional star schemas and fully denormalized tables with different Scale Factors (SFs), and analyzing the impact of adequate data partitioning in these two data modelling strategies.
引用
收藏
页码:3 / 16
页数:14
相关论文
共 50 条
  • [11] Importance of Data Distribution on Hive-based Systems for Query Performance: An Experimental Study
    Ciritoglu, Hilmi Egemen
    Murphy, John
    Thorpe, Christina
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2020), 2020, : 370 - 376
  • [12] Research and analysis of big data based on hadoop
    Liu, Xiaohong
    Wang, Wangang
    Zhu, Guangfu
    [J]. Boletin Tecnico/Technical Bulletin, 2017, 55 (04): : 382 - 386
  • [13] Hive - A Petabyte Scale Data Warehouse Using Hadoop
    Thusoo, Ashish
    Sen Sarma, Joydeep
    Jain, Namit
    Shao, Zheng
    Chakka, Prasad
    Zhang, Ning
    Antony, Suresh
    Liu, Hao
    Murthy, Raghotham
    [J]. 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, : 996 - 1005
  • [14] Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop
    Yu, Yanwei
    Zhao, Jindong
    Wang, Xiaodong
    Wang, Qin
    Zhang, Yonggang
    [J]. INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2015,
  • [15] Big Data Optimization Using Hive
    Neric, Vedrana
    Sarajlic, Nermin
    [J]. ELEKTROTEHNISKI VESTNIK, 2021, 88 (05): : 290 - 298
  • [16] Design Process for Big Data Warehouses
    Di Tria, Francesco
    Lefons, Ezio
    Tangorra, Filippo
    [J]. 2014 INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2014, : 512 - 518
  • [17] HaoLap: A Hadoop based OLAP system for big data
    Song, Jie
    Guo, Chaopeng
    Wang, Zhi
    Zhang, Yichan
    Yu, Ge
    Pierson, Jean-Marc
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2015, 102 : 167 - 181
  • [18] Hadoop Based Scalable Cluster Deduplication for Big Data
    Liu, Qing
    Fu, Yinjin
    Ni, Guiqiang
    Hou, Rui
    [J]. 2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016), 2016, : 98 - 105
  • [19] Big Data Security Problem Based on Hadoop Framework
    Samet, Refik
    Aydin, Ayhan
    Toy, Feridun
    [J]. 2019 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2019, : 525 - 530
  • [20] Big Data Optimization Using Hive
    Nerić, Vedrana
    Sarajlić, Nermin
    [J]. Elektrotehniski Vestnik/Electrotechnical Review, 2021, 85 (05): : 290 - 298