Finding top-k elements in data streams

被引:42
|
作者
Homem, Nuno [1 ]
Carvalho, Joao Paulo [1 ]
机构
[1] INESC ID, TULisbon Inst Super Tecn, P-1000029 Lisbon, Portugal
关键词
Approximate algorithms; Top-k algorithms; Most frequent; Estimation; Data stream frequencies; FREQUENT ITEMSETS;
D O I
10.1016/j.ins.2010.08.024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Identifying the most frequent elements in a data stream is a well known and difficult problem. Identifying the most frequent elements for each individual, especially in very large populations, is even harder. The use of fast and small memory footprint algorithms is paramount when the number of individuals is very large. In many situations such analysis needs to be performed and kept up to date in near real time. Fortunately, approximate answers are usually adequate when dealing with this problem. This paper presents a new and innovative algorithm that addresses this problem by merging the commonly used counter-based and sketch-based techniques for top-k identification. The algorithm provides the top-k list of elements, their frequency and an error estimate for each frequency value. It also provides strong guarantees on the error estimate, order of elements and inclusion of elements in the list depending on their real frequency. Additionally the algorithm provides stochastic bounds on the error and expected error estimates. Telecommunications customer's behavior and voice call data is used to present concrete results obtained with this algorithm and to illustrate improvements over previously existing algorithms. (C) 2010 Elsevier Inc. All rights reserved.
引用
收藏
页码:4958 / 4974
页数:17
相关论文
共 50 条
  • [31] Mining top-k frequent patterns over data streams sliding window
    Chen, Hui
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2014, 42 (01) : 111 - 131
  • [32] Finding top-k longest palindromes in substrings
    Mitani, Kazuki
    Mieno, Takuya
    Seto, Kazuhisa
    Horiyama, Takashi
    THEORETICAL COMPUTER SCIENCE, 2023, 979
  • [33] Finding skyline and top-k bargaining solutions
    Soliman, Mohamed A.
    Ilyas, Ihab F.
    Koudas, Nick
    2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2007, : 1238 - +
  • [34] Finding Top-k Optimal Sequenced Routes
    Liu, Huiping
    Jin, Cheqing
    Yang, Bin
    Zhou, Aoying
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 569 - 580
  • [35] Effective and efficient top-k query processing over incomplete data streams
    Ren, Weilong
    Lian, Xiang
    Ghazinour, Kambiz
    INFORMATION SCIENCES, 2021, 544 : 343 - 371
  • [36] LotterySampling: A Randomized Algorithm for the Heavy Hitters and Top-k Problems in Data Streams
    Martinez, Conrado
    Solera-Pardo, Gonzalo
    COMPUTING AND COMBINATORICS, COCOON 2022, 2022, 13595 : 24 - 35
  • [37] Continuously monitoring top-k uncertain data streams: a probabilistic threshold method
    Hua, Ming
    Pei, Jian
    DISTRIBUTED AND PARALLEL DATABASES, 2009, 26 (01) : 29 - 65
  • [38] Continuous Monitoring of Top-k Dominating Queries over Uncertain Data Streams
    Li, Guohui
    Luo, Changyin
    Li, Jianjun
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2014, PT I, 2014, 8786 : 244 - 255
  • [39] Using Bloom Filters for Mining Top-k Frequent Itemsets in Data Streams
    Kim, Younghee
    Cho, Kyungsoo
    Yoon, Jaeyeol
    Kim, Ieejoon
    Kim, Ungmo
    SECURE AND TRUST COMPUTING, DATA MANAGEMENT, AND APPLICATIONS, 2011, 186 : 209 - 216
  • [40] Continuously monitoring top-k uncertain data streams: a probabilistic threshold method
    Ming Hua
    Jian Pei
    Distributed and Parallel Databases, 2009, 26 : 29 - 65