Efficient processing of streaming updates with archived master data in near-real-time data warehousing

被引:2
|
作者
Naeem, M. Asif [1 ]
Dobbie, Gillian [2 ]
Weber, Gerald [2 ]
机构
[1] Auckland Univ Technol, Sch Comp & Math Sci, Auckland, New Zealand
[2] Univ Auckland, Dept Comp Sci, Auckland 1, New Zealand
关键词
Near-real-time data warehousing; Stream-based join; Data transformation; Performance and tuning; JOIN;
D O I
10.1007/s10115-013-0653-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm MESHJOIN (Mesh Join) has been proposed to amortize disk access over fast streams. MESHJOIN makes no assumptions about the data distribution. In real-world applications, however, skewed distributions can be found, such as a stream of products sold, where certain products are sold more frequently than the remainder of the products. The question arises is how much does MESHJOIN lose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be found in non-adaptive approaches such as MESHJOIN. We also present a cost model for X-HYBRIDJOIN, and based on that cost model, the algorithm is tuned.
引用
收藏
页码:615 / 637
页数:23
相关论文
共 50 条
  • [1] Efficient processing of streaming updates with archived master data in near-real-time data warehousing
    M. Asif Naeem
    Gillian Dobbie
    Gerald Weber
    [J]. Knowledge and Information Systems, 2014, 40 : 615 - 637
  • [2] HYBRIDJOIN for Near-Real-Time Data Warehousing
    Naeem, M. Asif
    Dobbie, Gillian
    Weber, Gerald
    [J]. INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2011, 7 (04) : 21 - 42
  • [3] Efficient Usage of Memory Resources in Near-Real-Time Data Warehousing
    Naeem, Muhammad Asif
    Dobbie, Gillian
    Weber, Gerald
    Bajwa, Imran Sarwar
    [J]. EMERGING TRENDS AND APPLICATIONS IN INFORMATION COMMUNICATION TECHNOLOGIES, 2012, 281 : 326 - +
  • [4] X-HYBRIDJOIN for Near-Real-Time Data Warehousing
    Naeem, Muhammad Asif
    Dobbie, Gillian
    Weber, Gerald
    [J]. ADVANCES IN DATABASES, 2011, 7051 : 33 - 47
  • [5] RECOMMENDED STANDARD FOR WAVE DATA SAMPLING AND NEAR-REAL-TIME PROCESSING
    TUCKER, MJ
    [J]. OCEAN ENGINEERING, 1993, 20 (05) : 459 - 474
  • [6] Combining neural networks for the near-real-time processing of satellite data
    Loyola, DG
    [J]. 2002 FIRST INTERNATIONAL IEEE SYMPOSIUM INTELLIGENT SYSTEMS, VOL 1, PROCEEDINGS, 2002, : 233 - 237
  • [7] Near-real-time applications of CloudSat Data
    Mitrescu, Cristian
    Miller, Steven
    Hawkins, Jeffrey
    L'Ecuyer, Tristan
    Turk, Joseph
    Partain, Philip
    Stephens, Graeme
    [J]. JOURNAL OF APPLIED METEOROLOGY AND CLIMATOLOGY, 2008, 47 (07) : 1982 - 1994
  • [8] An introduction to the near-real-time QuikSCAT data
    Hoffman, RN
    Leidner, SM
    [J]. WEATHER AND FORECASTING, 2005, 20 (04) : 476 - 493
  • [9] TinyLFU-based semi-stream cache join for near-real-time data warehousing
    Naeem, M. Asif
    Waqar, Wasiullah
    Mirza, Farhaan
    Tahir, Ali
    [J]. SOFT COMPUTING, 2022, 26 (20) : 11091 - 11103
  • [10] TinyLFU-based semi-stream cache join for near-real-time data warehousing
    M. Asif Naeem
    Wasiullah Waqar
    Farhaan Mirza
    Ali Tahir
    [J]. Soft Computing, 2022, 26 : 11091 - 11103