A Probabilistic Sketch for Summarizing Cold Items of Data Streams

被引:0
|
作者
Liu, Yongqiang [1 ,2 ]
Xie, Xike [2 ,3 ]
机构
[1] Univ Sci & Technol China, Dept Comp Sci & Technol, Hefei 230026, Peoples R China
[2] Univ Sci & Technol China, Suzhou Inst Adv Res, MIRACLE Ctr, Data Darkness Lab, Suzhou 215123, Peoples R China
[3] Univ Sci & Technol China, Dept Biomed Engn, Hefei 230026, Peoples R China
关键词
Data streams; sketch; data structures; COMPACT INVERTIBLE SKETCH; FINDING FREQUENT; ARCHITECTURE; FRAMEWORK; ALGORITHM; ELEMENTS;
D O I
10.1109/TNET.2023.3316426
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Conventional sketches on counting stream item frequencies use hash functions for mapping data items to a concise structure, e.g., a two-dimensional array, at the expense of overcounting due to hashing collisions. Despite the popularity, it is still challenging to handle cold (low-frequency) items, especially when the space is limited. The cold items can be misreported as hot (high-frequency) items as the accumulation of error in hashing collisions, leading to the estimation accuracy degrading. We find that a streaming item can be split into a set of compactly stored basic elements, which can be recomposed in a probabilistic manner to estimate the frequency of an item. Thus, we design a novel decomposition and recomposition framework, called the XY-sketch, which estimates the frequency of a stream item by estimating the probability of basic elements appearing in the data stream. By improving the estimation accuracy of cold items, we show that advanced streaming queries, such as top-k queries and heavy change queries. Throughout, we conduct theoretical analysis and optimizations under space constraints. Experiments on real datasets are conducted to examine the effectiveness of our proposals.
引用
收藏
页码:1287 / 1302
页数:16
相关论文
共 50 条
  • [1] Finding frequent items of data streams based on hierarchical sketch
    Network Information Center, Beijing Institute of Technology, Beijing 100081, China
    [J]. Beijing Ligong Daxue Xuebao, 2006, 6 (512-516):
  • [2] Identification of Heavy Hitters for Network Data Streams with Probabilistic Sketch
    Zhou, Aiping
    Zhu, Huisheng
    Liu, Lijun
    Zhu, Chengang
    [J]. 2018 IEEE 3RD INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2018, : 451 - 456
  • [3] Continuously Tracking Core Items in Data Streams with Probabilistic Decays
    Zhao, Junzhou
    Wang, Pinghui
    Tao, Jing
    Zhang, Shuo
    Lui, John C. S.
    [J]. 2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 769 - 780
  • [4] WavingSketch An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams
    Li, Jizhou
    Li, Zikun
    Xu, Yifei
    Jiang, Shiqi
    Yang, Tong
    Cui, Bin
    Dai, Yafei
    Zhang, Gong
    [J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1574 - 1584
  • [5] WavingSketch: an unbiased and generic sketch for finding top-k items in data streams
    Liu, Zirui
    Dong, Fenghao
    Liu, Chengwu
    Deng, Xiangwei
    Yang, Tong
    Zhao, Yikai
    Li, Jizhou
    Cui, Bin
    Zhang, Gong
    [J]. VLDB JOURNAL, 2024, 33 (05): : 1697 - 1722
  • [6] Scout Sketch+: Finding Both Promising and Damping Items Simultaneously in Data Streams
    Gao, Guoju
    Ma, Tianyu
    Huang, He
    Sun, Yu-E.
    Wang, Haibo
    Du, Yang
    Chen, Shigang
    [J]. IEEE/ACM Transactions on Networking, 2024, 32 (06) : 5491 - 5506
  • [7] Graph Stream Sketch: Summarizing Graph Streams With High Speed and Accuracy
    Gou, Xiangyang
    Zou, Lei
    Zhao, Chenxingyu
    Yang, Tong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (06) : 5901 - 5914
  • [8] Summarizing and Mining Skewed Data Streams
    Cormode, Graham
    Muthukrishnan, S.
    [J]. PROCEEDINGS OF THE FIFTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2005, : 44 - 55
  • [9] SUMMARIZING DATA USING PROBABILISTIC ASSERTIONS
    PEARL, J
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1977, 23 (04) : 459 - 465
  • [10] Summarizing distributed data streams for storage in data warehouses
    Chiky, Raja
    Hebrail, Georges
    [J]. DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2008, 5182 : 65 - 74