Efficient processing of streaming updates with archived master data in near-real-time data warehousing

被引:2
|
作者
Naeem, M. Asif [1 ]
Dobbie, Gillian [2 ]
Weber, Gerald [2 ]
机构
[1] Auckland Univ Technol, Sch Comp & Math Sci, Auckland, New Zealand
[2] Univ Auckland, Dept Comp Sci, Auckland 1, New Zealand
关键词
Near-real-time data warehousing; Stream-based join; Data transformation; Performance and tuning; JOIN;
D O I
10.1007/s10115-013-0653-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm MESHJOIN (Mesh Join) has been proposed to amortize disk access over fast streams. MESHJOIN makes no assumptions about the data distribution. In real-world applications, however, skewed distributions can be found, such as a stream of products sold, where certain products are sold more frequently than the remainder of the products. The question arises is how much does MESHJOIN lose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be found in non-adaptive approaches such as MESHJOIN. We also present a cost model for X-HYBRIDJOIN, and based on that cost model, the algorithm is tuned.
引用
下载
收藏
页码:615 / 637
页数:23
相关论文
共 50 条
  • [21] Near-real-time adjusted reanalysis forcing data for hydrology
    Berg, Peter
    Donnelly, Chantal
    Gustafsson, David
    HYDROLOGY AND EARTH SYSTEM SCIENCES, 2018, 22 (02) : 989 - 1000
  • [22] Reaching near-real-time data replication: Part 2
    1600, CMP Asia Ltd.- New York Office
  • [23] Near-Real-Time Ocean Color Data Processing Using Ancillary Data From the Global Forecast System Model
    Ramachandran, Sathyadev
    Wang, Menghua
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2011, 49 (04): : 1485 - 1495
  • [24] Real-time processing of streaming big data
    Safaei, Ali A.
    REAL-TIME SYSTEMS, 2017, 53 (01) : 1 - 44
  • [25] Real-time processing of streaming big data
    Ali A. Safaei
    Real-Time Systems, 2017, 53 : 1 - 44
  • [26] AUTOMATIC NEAR-REAL-TIME IMAGE PROCESSING CHAIN FOR VERY HIGH RESOLUTION OPTICAL SATELLITE DATA
    Ostir, K.
    Cotar, K.
    Marsetic, A.
    Pehani, P.
    Perse, M.
    Zaksek, K.
    Zaletelj, J.
    Rodic, T.
    36TH INTERNATIONAL SYMPOSIUM ON REMOTE SENSING OF ENVIRONMENT, 2015, 47 (W3): : 669 - 676
  • [27] Near-Real-Time OGC Catalogue Service for Geoscience Big Data
    Song, Jia
    Di, Liping
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2017, 6 (11)
  • [28] PRELIMINARY SPECTRAL-ANALYSIS OF NEAR-REAL-TIME RADON DATA
    MENDENHALL, MH
    SHAPIRO, MH
    MELVIN, JD
    TOMBRELLO, TA
    GEOPHYSICAL RESEARCH LETTERS, 1981, 8 (05) : 449 - 452
  • [29] Refreshing data warehouses with near real-time updates
    Rahman, Nayem
    JOURNAL OF COMPUTER INFORMATION SYSTEMS, 2007, 47 (03) : 71 - 80
  • [30] CALIOP near-real-time backscatter products compared to EARLINET data
    Grigas, T.
    Hervo, M.
    Gimmestad, G.
    Forrister, H.
    Schneider, P.
    Preissler, J.
    Tarrason, L.
    O'Dowd, C.
    ATMOSPHERIC CHEMISTRY AND PHYSICS, 2015, 15 (21) : 12179 - 12191