Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

被引:2
|
作者
Li, Jiajun [1 ]
Wei, Zhewei [1 ]
Ding, Bolin [2 ]
Dai, Xiening [2 ]
Lu, Lu [2 ]
Zhou, Jingren [2 ]
机构
[1] Renmin Univ China, Beijing, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
sampling; distributed environment; NDV; communication; COMPLEXITY;
D O I
10.1145/3534678.3539390
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's frequency of frequency in the distributed streaming model, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worstcase scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.
引用
收藏
页码:893 / 903
页数:11
相关论文
共 50 条
  • [1] A sampling-based method for dynamic scheduling in distributed data mining environment
    Li, Jifang
    WSEAS Transactions on Computers, 2009, 8 (01): : 63 - 72
  • [2] Entropy Estimation for ADC Sampling-Based True Random Number Generators
    Ma, Yuan
    Chen, Tianyu
    Lin, Jingqiang
    Yang, Jing
    Jing, Jiwu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2019, 14 (11) : 2887 - 2900
  • [3] A Sampling-based Scheduling Method for Distributed Computing
    Li, Jifang
    CISST'09: PROCEEDINGS OF THE 3RD WSEAS INTERNATIONAL CONFERENCE ON CIRCUITS, SYSTEMS, SIGNAL AND TELECOMMUNICATIONS, 2009, : 60 - 65
  • [4] Efficient Sampling-based ADMM for Distributed Data
    Wang, Jun-Kun
    Lin, Shou-De
    PROCEEDINGS OF 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS, (DSAA 2016), 2016, : 321 - 330
  • [5] The number of distinct values in a geometrically distributed sample
    Archibald, Margaret
    Knopfmacher, Arnold
    Prodinger, Helmut
    EUROPEAN JOURNAL OF COMBINATORICS, 2006, 27 (07) : 1059 - 1081
  • [6] Sampling-Based Correlation Estimation for Distributed Source Coding Under Rate and Complexity Constraints
    Cheung, Ngai-Man
    Wang, Huisheng
    Ortega, Antonio
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2008, 17 (11) : 2122 - 2137
  • [7] Sampling-Based Capacity Estimation for Unmanned Traffic Management
    Sedov, Leonid
    Polishchuk, Valentin
    Bulusu, Vishwanath
    2017 IEEE/AIAA 36TH DIGITAL AVIONICS SYSTEMS CONFERENCE (DASC), 2017,
  • [8] Sampling-based Distributed Kernel Mean Matching using Spark
    Haque, Ahsanul
    Wang, Zhuoyi
    Chandra, Swarup
    Gao, Yupeng
    Khan, Latifur
    Aggarwal, Charu
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 462 - 471
  • [9] Sampling-Based Event-Triggered Control for Distributed Generators
    Fan, Yuan
    Sheng, Mingwei
    Dong, Chuanbao
    Zhang, Yang
    2017 29TH CHINESE CONTROL AND DECISION CONFERENCE (CCDC), 2017, : 5556 - 5560
  • [10] Comparison of sampling-based algorithms for multisensor distributed target tracking
    Nguyen, TM
    Jilkov, VP
    Li, XR
    FUSION 2003: PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE OF INFORMATION FUSION, VOLS 1 AND 2, 2003, : 114 - 121