Erasable Virtual HyperLogLog for Approximating Cumulative Distribution over Data Streams

被引:1
|
作者
Jia, Peng [1 ]
Wang, Pinghui [1 ,2 ]
Zhao, Junzhou [1 ]
Tao, Jing [1 ]
Yuan, Ye [3 ]
Guan, Xiaohong [1 ]
机构
[1] Xi An Jiao Tong Univ, MOE Key Lab Intelligent Networks & Network Secur, Xian 710049, Shaanxi, Peoples R China
[2] Xi An Jiao Tong Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Beijing Inst Technol, Beijing 100811, Peoples R China
基金
中国国家自然科学基金;
关键词
Registers; Databases; Estimation; Memory management; Data privacy; Arrays; Sampling methods; Erasable virtual HyperLogLog; data distribution estimation; data streams; CARDINALITY ESTIMATION; EFFICIENT; TIME;
D O I
10.1109/TKDE.2021.3052938
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many real-world datasets are given in the stream of entity-identifier pairs, and measuring data distribution on these datasets is fundamental for applications such as privacy protection. In this paper, we study the problem of computing the cumulative distribution for different cardinalities (i.e., the number of distinct entities owning the same identifier). However, previous sketch-based methods cost large memory space especially when there are a large number of identifiers, and sampling-based methods require much time for cardinality estimation. A recent work KHyperLogLog combines both sketch and sampling methods but it is wasteful to separately build a HyperLogLog sketch of large size for identifiers with small cardinalities. To address these challenges, we propose a memory-efficient method EV-HLL, which designs a shared structure to store all sampled identifiers and their entities and utilizes additional sketches to track value updates during the sampling procedure. Meanwhile, EV-HLL provides real-time unbiased estimations according to value changes whenever a new entity-identifier pair arrives. We evaluate the performance of EV-HLL and other state-of-the-arts on real-world available datasets. Experimental results demonstrate that comparing to other methods, EV-HLL effectively reduces their memory usage with the same estimation accuracy and has higher accuracy with the same memory usage.
引用
收藏
页码:5336 / 5350
页数:15
相关论文
共 50 条
  • [21] Cumulative distribution of rainfall data for tropical countries
    Mandeep, J. S.
    Nalinggam, Renuka
    Ismai, Widad
    [J]. SCIENTIFIC RESEARCH AND ESSAYS, 2011, 6 (02): : 447 - 452
  • [22] Adaptive clusters and histograms over data streams
    Puttagunta, V
    Kalpakis, K
    [J]. IKE '05: PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, 2005, : 98 - 104
  • [23] ASSOCIATION RULE HIDING OVER DATA STREAMS
    Gunay, Ufuk
    Gundem, Taflan Imre
    [J]. INFORMATION TECHNOLOGY AND CONTROL, 2009, 38 (02): : 125 - 134
  • [24] Range counting over multidimensional data streams
    Suri, Subhash
    Toth, Csaba D.
    Zhou, Yunhong
    [J]. DISCRETE & COMPUTATIONAL GEOMETRY, 2006, 36 (04) : 633 - 655
  • [25] Practical Range Counting over Data Streams
    Bai, Ran
    Lai, Ziliang
    Lo, Eric
    Hon, Wing-Kai
    Zhang, Pengfei
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 659 - 668
  • [26] Dynamic Sketching over Distributed Data Streams
    Wu, Guangjun
    Jia, Siyu
    Li, Binbin
    Wang, Shupeng
    Bao, Xiuguo
    Yuan, Qingsheng
    [J]. 2016 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS), 2016,
  • [27] Enforcing Access Control Over Data Streams
    Carminati, Barbara
    Ferrari, Elena
    Tan, Kian Lee
    [J]. SACMAT'07: PROCEEDINGS OF THE 12TH ACM SYMPOSIUM ON ACCESS CONTROL MODELS AND TECHNOLOGIES, 2007, : 21 - 30
  • [28] Approximate Frequency Counts over Data Streams
    Manku, Gurmeet Singh
    Motwani, Rajeev
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 1699 - 1699
  • [29] Sliding windows over uncertain data streams
    Dallachiesa, Michele
    Jacques-Silva, Gabriela
    Gedik, Bugra
    Wu, Kun-Lung
    Palpanas, Themis
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 45 (01) : 159 - 190
  • [30] Statistical σ-partition clustering over data streams
    Park, NH
    Lee, WS
    [J]. KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 : 387 - 398