Mining discriminative items in multiple data streams

被引:6
|
作者
Lin, Zhenhua [1 ]
Jiang, Bin [1 ]
Pei, Jian [1 ]
Jiang, Daxin [2 ]
机构
[1] Simon Fraser Univ, Burnaby, BC V5A 1S6, Canada
[2] Microsoft Res Asia, Beijing, Peoples R China
基金
加拿大自然科学与工程研究理事会;
关键词
data mining; data streams; discriminative items; FINDING FREQUENT;
D O I
10.1007/s11280-010-0094-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
How can we maintain a dynamic profile capturing a user's reading interest against the common interest? What are the queries that have been asked 1,000 times more frequently to a search engine from users in Asia than in North America? What are the keywords (or tags) that are 1,000 times more frequent in the blog stream on computer games than in the blog stream on Hollywood movies? To answer such interesting questions, we need to find discriminative items in multiple data streams. Each data source, such as Web search queries in a region and blog postings on a topic, can be modeled as a data stream due to the fast growing volume of the source. Motivated by the extensive applications, in this paper, we study the problem of mining discriminative items in multiple data streams. We show that, to exactly find all discriminative items in stream S (1) against stream S (2) by one scan, the space lower bound is pound is the alphabet of items and n (1) is the current size of S (1). To tackle the space challenge, we develop three heuristic algorithms that can achieve high precision and recall using sub-linear space and sub-linear processing time per item with respect to |I | pound. The complexity of all algorithms are independent to the size of the two streams. An extensive empirical study using both real data sets and synthetic data sets verifies our design.
引用
收藏
页码:497 / 522
页数:26
相关论文
共 50 条
  • [41] Sequential pattern mining in multiple streams
    Chen, G
    Wu, XD
    Zhu, XQ
    Fifth IEEE International Conference on Data Mining, Proceedings, 2005, : 585 - 588
  • [42] Improved algorithm for parallel mining collaborative frequent itemsets in multiple data streams
    Liu, Fang'ai
    Wang, Qianqian
    Wang, Xin
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 3): : S6133 - S6141
  • [43] Improved algorithm for parallel mining collaborative frequent itemsets in multiple data streams
    Fang’ai Liu
    Qianqian Wang
    Xin Wang
    Cluster Computing, 2019, 22 : 6133 - 6141
  • [44] Distributed web mining using Bayesian networks from multiple data streams
    Chen, R
    Sivakumar, K
    Kargupta, H
    2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2001, : 75 - 82
  • [46] Mining serial episode rules with time lags over multiple data streams
    Lee, Tung-Ying
    Wang, En Tzu
    Chen, Arbee L. P.
    DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2008, 5182 : 227 - +
  • [47] Estimating the Frequency of Data Items in Massive Distributed Streams
    Anceaume, Emmanuelle
    Busnel, Yann
    Rivetti, Nicolo
    2015 IEEE 4TH SYMPOSIUM ON NETWORK CLOUD COMPUTING AND APPLICATIONS - NCCA 2015, 2015, : 59 - 66
  • [48] Scout Sketch: Finding Promising Items in Data Streams
    Ma, Tianyu
    Gao, Guoju
    Huang, He
    Sun, Yu-E
    Du, Yang
    IEEE INFOCOM 2024-IEEE CONFERENCE ON COMPUTER COMMUNICATIONS, 2024, : 1561 - 1570
  • [49] A Probabilistic Sketch for Summarizing Cold Items of Data Streams
    Liu, Yongqiang
    Xie, Xike
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (02) : 1287 - 1302
  • [50] Efficiently discovering recent frequent items in data streams
    Tantono, Ferry Irawan
    Manerikar, Nishad
    Palpanas, Thernis
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2008, 5069 : 222 - +