Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

被引:2
|
作者
Li, Jiajun [1 ]
Wei, Zhewei [1 ]
Ding, Bolin [2 ]
Dai, Xiening [2 ]
Lu, Lu [2 ]
Zhou, Jingren [2 ]
机构
[1] Renmin Univ China, Beijing, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
sampling; distributed environment; NDV; communication; COMPLEXITY;
D O I
10.1145/3534678.3539390
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's frequency of frequency in the distributed streaming model, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worstcase scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.
引用
收藏
页码:893 / 903
页数:11
相关论文
共 50 条
  • [21] Sampling-based estimation method for parameter estimation in big data business era
    Alim, Abdul
    Shukla, Diwakar
    JOURNAL OF ADVANCES IN MANAGEMENT RESEARCH, 2021, 18 (02) : 297 - 322
  • [22] Distributed Gibbs: A Linear-Space Sampling-Based DCOP Algorithm
    Duc Thien Nguyen
    Yeoh, William
    Lau, Hoong Chuin
    Zivan, Roie
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2019, 64 : 705 - 748
  • [23] Sampling-Based Caching for Low Latency in Distributed Coded Storage Systems
    Liu, Kaiyang
    Wang, Jingrong
    Li, Heng
    Peng, Jun
    Pan, Jianping
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (06) : 4275 - 4287
  • [24] A Sampling-Based Distributed Exploration Method for UAV Cluster in Unknown Environments
    Wang, Yue
    Li, Xinpeng
    Zhuang, Xing
    Li, Fanyu
    Liang, Yutao
    DRONES, 2023, 7 (04)
  • [25] Adaptive sampling-based profiling techniques for optimizing the distributed JVM runtime
    Department of Computer Science, University of Hong Kong, Hong Kong, Hong Kong
    Proc. IEEE Int. Symp. Parallel Distrib. Process., IPDPS,
  • [26] Distributed subdata selection for big data via sampling-based approach
    Zhang, Haixiang
    Wang, HaiYing
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2021, 153
  • [27] Sampling-based Collision Warning System with Smartphone in Cloud Computing Environment
    Tak, S.
    Woo, S.
    Yeo, H.
    2015 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2015, : 1181 - 1186
  • [28] A Compressive Sampling-Based Channel Estimation Method for Network Visibility Instrumentation
    De Vito, Luca
    Picariello, Francesco
    Rapuano, Sergio
    Tudosa, Ioan
    Barford, Lee
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2020, 69 (05) : 2335 - 2344
  • [29] Visual analytics system for LOD using sampling-based structure estimation
    Takama Y.
    Yabe A.
    Ishikawa H.
    Transactions of the Japanese Society for Artificial Intelligence, 2017, 32 (01) : WII - B_1
  • [30] Importance sampling-based estimation over AND/OR search spaces for graphical models
    Gogate, Vibhav
    Dechter, Rina
    ARTIFICIAL INTELLIGENCE, 2012, 184 : 38 - 77