Dalton: Learned Partitioning for Distributed Data Streams

被引:3
|
作者
Zapridou, Eleni [1 ]
Mytilinis, Ioannis [2 ]
Ailamaki, Anastasia [1 ]
机构
[1] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
[2] Oracle, Lausanne, Switzerland
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2022年 / 16卷 / 03期
关键词
D O I
10.14778/3570690.3570699
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To sustain the input rate of high-throughput streams, modern stream processing systems rely on parallel execution. However, skewed data yield imbalanced load assignments and create stragglers that hinder scalability. Deciding on a static partitioning for a given set of "hot" keys is not sufficient as these keys are not known in advance, and even worse, the data distribution can change unpredictably. Existing algorithms either optimize for a specific distribution or, in order to adapt, assume a centralized partitioner that processes every incoming tuple and observes the whole workload. However, this is not realistic in a distributed environment, where multiple parallel upstream operators exist, as the centralized partitioner itself becomes the bottleneck and limits scalability. In this work, we propose Dalton: a lightweight, adaptive, yet scalable partitioning operator that relies on reinforcement learning. By memoizing state and dynamically keeping track of recent experience, Dalton: i) adjusts its policy at runtime and quickly adapts to the workload, ii) avoids redundant computations and minimizes the per-tuple partitioning overhead, and iii) efficiently scales out to multiple instances that learn cooperatively and converge to a joint policy. Our experiments indicate that Dalton scales regardless of the input data distribution and sustains 1.3 x - 6.7x higher throughput than existing approaches.
引用
收藏
页码:491 / 504
页数:14
相关论文
共 50 条
  • [1] Learned Spatial Data Partitioning
    Hori, Keizo
    Sasaki, Yuya
    Amagata, Daichi
    Murosaki, Yuki
    Onizuka, Makoto
    [J]. PROCEEDINGS OF THE SIXTH INTERNATIONAL WORKSHOP ON EXPLOITING ARTIFICIAL INTELLIGENCE TECHNIQUES FOR DATA MANAGEMENT, AIDM 2023, 2023,
  • [2] Querying Distributed Data Streams
    Garofalakis, Minos
    [J]. ADVANCES IN DATABASES AND INFORMATION SYSTEMS (ADBIS 2014), 2014, 8716
  • [3] Adaptive spatial partitioning for multidimensional data streams
    Hershberger, J
    Shrivastava, N
    Suri, S
    Tóth, CD
    [J]. ALGORITHMS AND COMPUTATION, 2004, 3341 : 522 - 533
  • [4] Adaptive spatial partitioning for multidimensional data streams
    Hershberger, John
    Shrivastava, Nisheeth
    Suri, Subhash
    Toth, Csaba D.
    [J]. ALGORITHMICA, 2006, 46 (01) : 97 - 117
  • [5] A statistical μ-partitioning method for clustering data streams
    Park, NH
    Lee, WS
    [J]. COMPUTER AND INFORMATION SCIENCES - ISCIS 2003, 2003, 2869 : 292 - 299
  • [6] Adaptive Spatial Partitioning for Multidimensional Data Streams
    John Hershberger
    Nisheeth Shrivastava
    Subhash Suri
    Csaba D. Toth
    [J]. Algorithmica, 2006, 46 : 97 - 117
  • [7] Stable Learned Bloom Filters for Data Streams
    Liu, Qiyu
    Zheng, Libin
    Shen, Yanyan
    Chen, Lei
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (11): : 2355 - 2367
  • [8] A Framework for Distributed Cleaning of Data Streams
    Gill, Saul
    Lee, Brian
    [J]. 6TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT-2015), THE 5TH INTERNATIONAL CONFERENCE ON SUSTAINABLE ENERGY INFORMATION TECHNOLOGY (SEIT-2015), 2015, 52 : 1186 - 1191
  • [9] Distributed clustering of ubiquitous data streams
    Rodrigues, Pedro Pereira
    Gama, Joao
    [J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 4 (01) : 38 - 54
  • [10] Thresholded Monitoring in Distributed Data Streams
    Li, Meng
    Dai, Haipeng
    Wang, Xiaoyu
    Xia, Rui
    Liu, Alex X.
    Chen, Guihai
    [J]. 2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 218 - 227