Erasable Virtual HyperLogLog for Approximating Cumulative Distribution over Data Streams

被引：1

作者：

Jia, Peng ^{[1
]}

Wang, Pinghui ^{[1
,2
]}

Zhao, Junzhou ^{[1
]}

Tao, Jing ^{[1
]}

Yuan, Ye ^{[3
]}

Guan, Xiaohong ^{[1
]}

机构：

[1] Xi An Jiao Tong Univ, MOE Key Lab Intelligent Networks & Network Secur, Xian 710049, Shaanxi, Peoples R China

[2] Xi An Jiao Tong Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China

[3] Beijing Inst Technol, Beijing 100811, Peoples R China

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2022年 / 34卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Registers; Databases; Estimation; Memory management; Data privacy; Arrays; Sampling methods; Erasable virtual HyperLogLog; data distribution estimation; data streams; CARDINALITY ESTIMATION; EFFICIENT; TIME;

D O I：

10.1109/TKDE.2021.3052938

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many real-world datasets are given in the stream of entity-identifier pairs, and measuring data distribution on these datasets is fundamental for applications such as privacy protection. In this paper, we study the problem of computing the cumulative distribution for different cardinalities (i.e., the number of distinct entities owning the same identifier). However, previous sketch-based methods cost large memory space especially when there are a large number of identifiers, and sampling-based methods require much time for cardinality estimation. A recent work KHyperLogLog combines both sketch and sampling methods but it is wasteful to separately build a HyperLogLog sketch of large size for identifiers with small cardinalities. To address these challenges, we propose a memory-efficient method EV-HLL, which designs a shared structure to store all sampled identifiers and their entities and utilizes additional sketches to track value updates during the sampling procedure. Meanwhile, EV-HLL provides real-time unbiased estimations according to value changes whenever a new entity-identifier pair arrives. We evaluate the performance of EV-HLL and other state-of-the-arts on real-world available datasets. Experimental results demonstrate that comparing to other methods, EV-HLL effectively reduces their memory usage with the same estimation accuracy and has higher accuracy with the same memory usage.

引用

页码：5336 / 5350

页数：15

共 50 条

[1] Erasable pattern mining based on tree structures with damped window over data streams
Baek, Yoonji
Yun, Unil
Kim, Heonho
Nam, Hyoju
Lee, Gangin
Yoon, Eunchul
Vo, Bay
Lin, Jerry Chun-Wei
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2020, 94
[2] ON APPROXIMATING MATRIX NORMS IN DATA STREAMS
Li, Yi
Nguyen, Huy L.
Woodruff, David P.
[J]. SIAM JOURNAL ON COMPUTING, 2019, 48 (06) : 1643 - 1697
[3] A Framework for Processing Cumulative Frequency Queries over Medical Data Streams
Al-Shammari, Ahmed
Zhou, Rui
Liu, Chengfei
Naseriparsa, Mehdi
Bao Quoc Vo
[J]. WEB INFORMATION SYSTEMS ENGINEERING, WISE 2018, PT II, 2018, 11234 : 121 - 131
[4] Approximating a cumulative distribution function by generalized hyperexponential distributions
Ou, JH
Li, JW
Ozekici, S
[J]. PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 1997, 11 (01) : 11 - 18
[5] SPPC: a new tree structure for mining erasable patterns in data streams
Le, Tuong
Vo, Bay
Fournier-Viger, Philippe
Lee, Mi Young
Baik, Sung Wook
[J]. APPLIED INTELLIGENCE, 2019, 49 (02) : 478 - 495
[6] SPPC: a new tree structure for mining erasable patterns in data streams
Tuong Le
Bay Vo
Philippe Fournier-Viger
Mi Young Lee
Sung Wook Baik
[J]. Applied Intelligence, 2019, 49 : 478 - 495
[7] APPROXIMATING THE CUMULATIVE CHI-SQUARE DISTRIBUTION AND ITS INVERSE
LIN, JT
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES D-THE STATISTICIAN, 1988, 37 (01) : 3 - 5
[8] APPROXIMATING THE NORMAL CUMULATIVE DISTRIBUTION FUNCTION USING A SPREADSHEET PROGRAM
FLEMING, NS
[J]. AMERICAN STATISTICIAN, 1989, 43 (01): : 68 - 68
[9] A scalable approach to approximating aggregate queries over intermittent streams
Zhu, SZ
Ravishankar, C
[J]. 16TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2004, : 85 - 94
[10] Approximating Inverse Cumulative Distribution Functions to Produce Approximate Random Variables
Giles, Michael
Sheridan-Methven, Oliver
[J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2023, 49 (03):

← 1 2 3 4 5 →