Mutual information, phi-squared and model-based co-clustering for contingency tables

被引:36
|
作者
Govaert, Gerard [1 ]
Nadif, Mohamed [2 ]
机构
[1] UTC, UMR CNRS, Heudiasyc 7253, F-60205 Compiegne, France
[2] Univ Paris 05, LIPADE, F-75006 Paris, France
关键词
Co-clustering; Biclustering; Contingency table; Information theory; 62-07; EM ALGORITHM; OPTIMIZATION; SPARSE;
D O I
10.1007/s11634-016-0274-6
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.
引用
收藏
页码:455 / 488
页数:34
相关论文
共 43 条
  • [41] Autoregressive model-based fuzzy clustering and its application for detecting information redundancy in air pollution monitoring networks
    Pierpaolo D’Urso
    Dario Di Lallo
    Elizabeth Ann Maharaj
    Soft Computing, 2013, 17 : 83 - 131
  • [42] A novel model-free data analysis technique based on clustering in a mutual information space: application to resting-state fMRI
    Benjaminsson, Simon
    Fransson, Peter
    Lansner, Anders
    FRONTIERS IN SYSTEMS NEUROSCIENCE, 2010, 4
  • [43] MBMM: Moment Estimating Beta Mixture Model-based Clustering Algorithm for m6A Co-methylation Module Mining
    Liu, Zhaoyang
    Yin, Hongsheng
    Chen, Shutao
    Liu, Hui
    Meng, Jia
    Wang, HongLei
    Zhang, Lin
    CURRENT BIOINFORMATICS, 2021, 16 (10) : 1244 - 1256