Erasable Virtual HyperLogLog for Approximating Cumulative Distribution over Data Streams

被引:1
|
作者
Jia, Peng [1 ]
Wang, Pinghui [1 ,2 ]
Zhao, Junzhou [1 ]
Tao, Jing [1 ]
Yuan, Ye [3 ]
Guan, Xiaohong [1 ]
机构
[1] Xi An Jiao Tong Univ, MOE Key Lab Intelligent Networks & Network Secur, Xian 710049, Shaanxi, Peoples R China
[2] Xi An Jiao Tong Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Beijing Inst Technol, Beijing 100811, Peoples R China
基金
中国国家自然科学基金;
关键词
Registers; Databases; Estimation; Memory management; Data privacy; Arrays; Sampling methods; Erasable virtual HyperLogLog; data distribution estimation; data streams; CARDINALITY ESTIMATION; EFFICIENT; TIME;
D O I
10.1109/TKDE.2021.3052938
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many real-world datasets are given in the stream of entity-identifier pairs, and measuring data distribution on these datasets is fundamental for applications such as privacy protection. In this paper, we study the problem of computing the cumulative distribution for different cardinalities (i.e., the number of distinct entities owning the same identifier). However, previous sketch-based methods cost large memory space especially when there are a large number of identifiers, and sampling-based methods require much time for cardinality estimation. A recent work KHyperLogLog combines both sketch and sampling methods but it is wasteful to separately build a HyperLogLog sketch of large size for identifiers with small cardinalities. To address these challenges, we propose a memory-efficient method EV-HLL, which designs a shared structure to store all sampled identifiers and their entities and utilizes additional sketches to track value updates during the sampling procedure. Meanwhile, EV-HLL provides real-time unbiased estimations according to value changes whenever a new entity-identifier pair arrives. We evaluate the performance of EV-HLL and other state-of-the-arts on real-world available datasets. Experimental results demonstrate that comparing to other methods, EV-HLL effectively reduces their memory usage with the same estimation accuracy and has higher accuracy with the same memory usage.
引用
收藏
页码:5336 / 5350
页数:15
相关论文
共 50 条
  • [1] Erasable pattern mining based on tree structures with damped window over data streams
    Baek, Yoonji
    Yun, Unil
    Kim, Heonho
    Nam, Hyoju
    Lee, Gangin
    Yoon, Eunchul
    Vo, Bay
    Lin, Jerry Chun-Wei
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2020, 94
  • [2] ON APPROXIMATING MATRIX NORMS IN DATA STREAMS
    Li, Yi
    Nguyen, Huy L.
    Woodruff, David P.
    [J]. SIAM JOURNAL ON COMPUTING, 2019, 48 (06) : 1643 - 1697
  • [3] A Framework for Processing Cumulative Frequency Queries over Medical Data Streams
    Al-Shammari, Ahmed
    Zhou, Rui
    Liu, Chengfei
    Naseriparsa, Mehdi
    Bao Quoc Vo
    [J]. WEB INFORMATION SYSTEMS ENGINEERING, WISE 2018, PT II, 2018, 11234 : 121 - 131
  • [4] Approximating a cumulative distribution function by generalized hyperexponential distributions
    Ou, JH
    Li, JW
    Ozekici, S
    [J]. PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 1997, 11 (01) : 11 - 18
  • [5] SPPC: a new tree structure for mining erasable patterns in data streams
    Le, Tuong
    Vo, Bay
    Fournier-Viger, Philippe
    Lee, Mi Young
    Baik, Sung Wook
    [J]. APPLIED INTELLIGENCE, 2019, 49 (02) : 478 - 495
  • [6] SPPC: a new tree structure for mining erasable patterns in data streams
    Tuong Le
    Bay Vo
    Philippe Fournier-Viger
    Mi Young Lee
    Sung Wook Baik
    [J]. Applied Intelligence, 2019, 49 : 478 - 495
  • [7] APPROXIMATING THE CUMULATIVE CHI-SQUARE DISTRIBUTION AND ITS INVERSE
    LIN, JT
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES D-THE STATISTICIAN, 1988, 37 (01) : 3 - 5
  • [8] APPROXIMATING THE NORMAL CUMULATIVE DISTRIBUTION FUNCTION USING A SPREADSHEET PROGRAM
    FLEMING, NS
    [J]. AMERICAN STATISTICIAN, 1989, 43 (01): : 68 - 68
  • [9] A scalable approach to approximating aggregate queries over intermittent streams
    Zhu, SZ
    Ravishankar, C
    [J]. 16TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2004, : 85 - 94
  • [10] Approximating Inverse Cumulative Distribution Functions to Produce Approximate Random Variables
    Giles, Michael
    Sheridan-Methven, Oliver
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2023, 49 (03):