An MDL framework for data clustering

被引:0
|
作者
Kontkanen, P [1 ]
Myllymäki, P [1 ]
Buntine, W [1 ]
Rissanen, J [1 ]
Tirri, H [1 ]
机构
[1] Aalto Univ, CoSCo, FIN-02015 Helsinki, Finland
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We regard clustering as a data assignment problem where the goal is to partition the data into several nonhierarchical groups of items. For solving this problem, we suggest an information-theoretic framework based on the minimum description length (MDL) principle. Intuitively, the idea is that we group together those data items that can be compressed well together, so that the total code length over all the data groups is optimized. One can argue that as efficient compression is possible only when one has discovered underlying regularities that are common to all the members of a group, this approach produces an implicitly defined similarity metric between the data items. Formally the global code length criterion to be optimized is defined by using the intuitively appealing universal normalized maximum likelihood code which has been shown to produce an optimal compression rate in an explicitly defined manner. The number of groups can be assumed to be unknown, and the problem of deciding the optimal number is formalized as part of the same theoretical framework. In the empirical part of the paper we present results that demonstrate the validity of the suggested clustering framework.
引用
收藏
页码:323 / 353
页数:31
相关论文
共 50 条
  • [1] MDL hierarchical clustering with incomplete data
    Lai, Po-Hsiang
    O'Sullivan, Joseph A.
    2010 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), 2010, : 369 - 373
  • [2] An MDL Analysis Framework for eQTL Data
    Chalkidis, Georgios
    Sugano, Sumio
    2014 ASIA-PACIFIC WORLD CONGRESS ON COMPUTER SCIENCE AND ENGINEERING (APWC ON CSE), 2014,
  • [3] MDL Hierarchical Clustering for Stemmatology
    Lai, Po-Hsiang
    Roos, Teemu
    O'Sullivan, Joseph A.
    2010 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2010, : 1403 - 1407
  • [4] A Framework for Clustering Uncertain Data
    Schubert, Erich
    Koos, Alexander
    Emrich, Tobias
    Zuefle, Andreas
    Schmid, Klaus Arthur
    Zimek, Arthur
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1977 - 1980
  • [5] MDL-Based Hierarchical Clustering
    Markov, Zdravko
    2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 471 - 474
  • [6] A unified framework for approximating and clustering data
    California Institute of Technology, Pasadena, CA 91125, United States
    不详
    Proc. Annu. ACM Symp. Theory Comput., (569-578):
  • [7] A Unified Framework for Approximating and Clustering Data
    Feldman, Dan
    Langberg, Michael
    STOC 11: PROCEEDINGS OF THE 43RD ACM SYMPOSIUM ON THEORY OF COMPUTING, 2011, : 569 - 578
  • [8] An Adaptive Framework for Clustering Data Streams
    Chandrika
    Kumar, K. R. Ananda
    ADVANCES IN COMPUTING AND COMMUNICATIONS, PT I, 2011, 190 : 704 - +
  • [9] Simultaneous clustering and subset selection via MDL
    Jörnsten, R
    Yu, B
    ADVANCES IN MINIMUM DESCRIPTION LENGTH THEORY AND APPLICATIONS, 2005, : 295 - 321
  • [10] MDL-based time series clustering
    Rakthanmanon, Thanawin
    Keogh, Eamonn J.
    Lonardi, Stefano
    Evans, Scott
    KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 33 (02) : 371 - 399