On efficiently summarizing categorical databases

被引:29
|
作者
Wang, JY
Karypis, G [1 ]
机构
[1] Univ Minnesota, Digital Technol Ctr, Dept Comp Sci, Minneapolis, MN 55455 USA
[2] Univ Minnesota, Army HPC Res Ctr, Minneapolis, MN 55455 USA
[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
关键词
data mining; frequent itemset; categorical database; clustering;
D O I
10.1007/s10115-005-0216-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to transaction or document clustering. However, most of the frequent itemset based clustering algorithms need to first mine a large intermediate set of frequent itemsets in order to identify a subset of the most promising ones that can be used for clustering. In this paper, we study how to directly find a subset of high quality frequent itemsets that can be used as a concise summary of the transaction database and to cluster the categorical data. By exploring key properties of the subset of itemsets that we are interested in, we proposed several search space pruning methods and designed an efficient algorithm called SUMMARY. Our empirical results show that SUMMARY runs very fast even when the minimum support is extremely low and scales very well with respect to the database size, and surprisingly, as a: pure frequent itemset mining algorithm it is very effective in clustering the categorical data and summarizing the dense transaction databases.
引用
下载
收藏
页码:19 / 37
页数:19
相关论文
共 50 条
  • [31] Summarizing Association Patterns Efficiently by Using PI Tree in a Data Stream Environment
    Lee, Guanling
    Zhu, Yu-tang
    Chen, Yi-Chun
    JOURNAL OF INTERNET TECHNOLOGY, 2012, 13 (02): : 359 - 368
  • [32] Association rules with opposite items in large categorical databases
    Wei, Q
    Chen, GQ
    FLEXIBLE QUERY ANSWERING SYSTEMS: RECENT ADVANCES, 2001, : 507 - 514
  • [33] Context-Based Similarity Measures for Categorical Databases
    Das, Gautam
    Mannila, Heikki
    LECTURE NOTES IN COMPUTER SCIENCE <D>, 2000, 1910 : 201 - 210
  • [34] Geometrical codification for clustering mixed categorical and numerical databases
    Fatima Barcelo-Rico
    Jose-Luis Diez
    Journal of Intelligent Information Systems, 2012, 39 : 167 - 185
  • [35] Geometrical codification for clustering mixed categorical and numerical databases
    Barcelo-Rico, Fatima
    Diez, Jose-Luis
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2012, 39 (01) : 167 - 185
  • [36] A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data
    Chen, Guanhua
    Ma, Xiuli
    Yang, Dongqing
    Tang, Shiwei
    Meng Shuai
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2009, 5566 : 580 - +
  • [37] Efficiently repairing and measuring replica consistency in distributed databases
    Garcia-Garcia, Javier
    Ordonez, Carlos
    Tosic, Predrag T.
    DISTRIBUTED AND PARALLEL DATABASES, 2013, 31 (03) : 377 - 411
  • [38] Efficiently finding unusual shapes in large image databases
    Wei, Li
    Keogh, Eamonn
    Xi, Xiaopeng
    Yoder, Melissa
    DATA MINING AND KNOWLEDGE DISCOVERY, 2008, 17 (03) : 343 - 376
  • [39] Efficiently finding unusual shapes in large image databases
    Li Wei
    Eamonn Keogh
    Xiaopeng Xi
    Melissa Yoder
    Data Mining and Knowledge Discovery, 2008, 17 : 343 - 376
  • [40] Efficiently computing weighted proximity relationships in spatial databases
    Lin, XM
    Zhou, XM
    Liu, CF
    Zhou, XF
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2001, 2118 : 279 - 290