Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

被引:2
|
作者
Li, Jiajun [1 ]
Wei, Zhewei [1 ]
Ding, Bolin [2 ]
Dai, Xiening [2 ]
Lu, Lu [2 ]
Zhou, Jingren [2 ]
机构
[1] Renmin Univ China, Beijing, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
sampling; distributed environment; NDV; communication; COMPLEXITY;
D O I
10.1145/3534678.3539390
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's frequency of frequency in the distributed streaming model, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worstcase scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.
引用
收藏
页码:893 / 903
页数:11
相关论文
共 50 条
  • [41] A Sampling-based Next-Best-View Path Planner for Environment Exploration
    Liu, Qishuai
    Jiang, Yufan
    Li, Ying
    2023 9TH INTERNATIONAL CONFERENCE ON MECHATRONICS AND ROBOTICS ENGINEERING, ICMRE, 2023, : 128 - 132
  • [42] Application of a sampling-based method for estimation of cumulative failure probability functions of mechanisms
    Li, Hong-Shuang
    Wang, Xiao-Wei
    Nan, Hang
    Liu, Miao
    MECHANISM AND MACHINE THEORY, 2021, 155
  • [43] GRAPH REDUCTIONS TO SPEED UP IMPORTANCE SAMPLING-BASED STATIC RELIABILITY ESTIMATION
    L'Ecuyer, Pierre
    Saggadi, Samira
    Tuffin, Bruno
    PROCEEDINGS OF THE 2011 WINTER SIMULATION CONFERENCE (WSC), 2011, : 429 - 438
  • [44] Gibbs Sampling-based Sparse Estimation Method over Underwater Acoustic Channels
    Tong, Wentao
    Ge, Wei
    Jia, Yizhen
    Zhang, Jiaheng
    JOURNAL OF MARINE SCIENCE AND APPLICATION, 2024, 23 (02) : 434 - 442
  • [45] Sampling-Based Binary-Level Cross-Platform Performance Estimation
    Zheng, Xinnian
    Vikalo, Haris
    Song, Shuang
    John, Lizy K.
    Gerstlauer, Andreas
    PROCEEDINGS OF THE 2017 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2017, : 1709 - 1714
  • [46] Sub-Nyquist rate ADC sampling-based compressive channel estimation
    Gui, Guan
    Peng, Wei
    Adachi, Fumiyuki
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2015, 15 (04): : 639 - 648
  • [47] A Sampling-Based Approach for Achieving Desired Patterns of Probabilistic Coverage with Distributed Sensor Networks
    Costa, Russell
    Wettergren, Thomas A.
    SENSORS, 2023, 23 (13)
  • [48] Sampling-based Path Planning with Goal Oriented Sampling
    Kang, Gitae
    Kim, Yong Bum
    You, Won Suk
    Lee, Young Hun
    Oh, Hyun Seok
    Moon, Hyungpil
    Choi, Hyouk Ryeol
    2016 IEEE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT MECHATRONICS (AIM), 2016, : 1285 - 1290
  • [49] Bayesian Local Sampling-Based Planning
    Lai, Tin
    Morere, Philippe
    Ramos, Fabio
    Francis, Gilad
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2020, 5 (02): : 1954 - 1961
  • [50] Generalized Sampling-Based Motion Planners
    Chakravorty, Suman
    Kumar, Sandip
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2011, 41 (03): : 855 - 866