Meshing streaming updates with persistent data in an active data warehouse

被引:44
|
作者
Polyzotis, Neoklis [1 ]
Skiadopoulos, Spiros [2 ]
Vassiliadis, Panos [3 ]
Simitsis, Alkis [4 ]
Frantzell, Nils-Erik [5 ]
机构
[1] Univ Calif Santa Cruz, Dept Comp Sci, Santa Cruz, CA 95064 USA
[2] Univ Peloponnese, Dept Comp Sci & Technol, Tripoli 22100, Hellas, Libya
[3] Univ Ioannina, Dept Comp Sci, GR-45110 Ioannina, Hellas, Greece
[4] IBM Corp, Almaden Res Ctr, Adv Data Serv, San Jose, CA 95120 USA
[5] Microsoft Corp, Redmond, WA 98052 USA
关键词
active data warehouse; join; MESHJOIN; streams; relations;
D O I
10.1109/TKDE.2008.27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Active Data Warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed online and thus achieves a higher consistency between the stored information and the latest data updates. The need for online warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream S of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations such as surrogate key assignment, duplicate detection, or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join ( MESHJOIN), which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of S. We detail the MESHJOIN algorithm and develop a systematic cost model that enables the tuning of MESHJOIN for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of MESHJOIN on synthetic and real-life data. Our results verify the scalability of MESHJOIN to fast streams and large relations and demonstrate its numerous advantages over existing join algorithms.
引用
收藏
页码:976 / 991
页数:16
相关论文
共 50 条
  • [1] Supporting streaming updates in an active data warehouse
    Polyzotis, Neoklis
    Skiadopoulos, Spiros
    Vassiliadis, Panos
    Simitsis, Alkis
    Frantzell, Nils-Erik
    2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2007, : 451 - +
  • [2] A Partition-based Approach to Support Streaming Updates over Persistent Data in an Active Data Warehouse
    Chakraborty, Abhirup
    Singh, Ajit
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 907 - 917
  • [3] The TargetMine Data Warehouse: Enhancement and Updates
    Chen, Yi-An
    Tripathi, Lokesh P.
    Fujiwara, Takeshi
    Kameyama, Tatsuya
    Itoh, Mari N.
    Mizuguchi, Kenji
    FRONTIERS IN GENETICS, 2019, 10
  • [4] Data warehouse maintenance under concurrent scheme and data updates
    Zhang, X
    Rundensteiner, EA
    15TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1999, : 253 - 253
  • [5] Incremental updates using Data Warehouse versus Data Marts
    Chakraborty, Sonali
    Doshi, Jyotika
    2018 4TH INTERNATIONAL CONFERENCE FOR CONVERGENCE IN TECHNOLOGY (I2CT), 2018,
  • [6] Persistent Homology on Streaming Data
    Moitra, Anindya
    Malott, Nicholas O.
    Wilsey, Philip A.
    20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2020), 2020, : 636 - 643
  • [7] Scalable Scheduling of Updates in Streaming Data Warehouses
    Golab, Lukasz
    Johnson, Theodore
    Shkapenyuk, Vladislav
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (06) : 1092 - 1105
  • [8] Computation of persistent homology on streaming data using topological data summaries
    Moitra, Anindya
    Malott, Nicholas O. O.
    Wilsey, Philip A. A.
    COMPUTATIONAL INTELLIGENCE, 2023, 39 (05) : 860 - 899
  • [9] Active Data Warehouse: Review, Challenges and Issues
    Hajlaoui, Jalel Eddine
    Hamdani, Nesrine
    2014 WORLD SYMPOSIUM ON COMPUTER APPLICATIONS & RESEARCH (WSCAR), 2014,
  • [10] Active Learning with Evolving Streaming Data
    Zliobaite, Indre
    Bifet, Albert
    Pfahringer, Bernhard
    Holmes, Geoff
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT III, 2011, 6913 : 597 - 612