Mutual information, phi-squared and model-based co-clustering for contingency tables

被引:36
|
作者
Govaert, Gerard [1 ]
Nadif, Mohamed [2 ]
机构
[1] UTC, UMR CNRS, Heudiasyc 7253, F-60205 Compiegne, France
[2] Univ Paris 05, LIPADE, F-75006 Paris, France
关键词
Co-clustering; Biclustering; Contingency table; Information theory; 62-07; EM ALGORITHM; OPTIMIZATION; SPARSE;
D O I
10.1007/s11634-016-0274-6
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.
引用
收藏
页码:455 / 488
页数:34
相关论文
共 43 条
  • [1] Mutual information, phi-squared and model-based co-clustering for contingency tables
    Gérard Govaert
    Mohamed Nadif
    Advances in Data Analysis and Classification, 2018, 12 : 455 - 488
  • [2] Model-based co-clustering for ordinal data
    Jacques, Julien
    Biernacki, Christophe
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2018, 123 : 101 - 115
  • [3] Model-based co-clustering for functional data
    Ben Slimen, Yosra
    Allio, Sylvain
    Jacques, Julien
    NEUROCOMPUTING, 2018, 291 : 97 - 108
  • [4] Model-based Poisson co-clustering for Attributed Networks
    Riverain, Paul
    Fossier, Simon
    Nadif, Mohamed
    21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 703 - 710
  • [5] Model-based co-clustering for mixed type data
    Selosse, Margot
    Jacques, Julien
    Biernacki, Christophe
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2020, 144
  • [6] blockcluster: An R Package for Model-Based Co-Clustering
    Bhatia, Parmeet Singh
    Iovleff, Serge
    Govaert, Gerard
    JOURNAL OF STATISTICAL SOFTWARE, 2017, 76 (09): : 1 - 24
  • [7] Co-clustering contaminated data: a robust model-based approach
    Fibbi, Edoardo
    Perrotta, Domenico
    Torti, Francesca
    Van Aelst, Stefan
    Verdonck, Tim
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2024, 18 (01) : 121 - 161
  • [8] Co-clustering contaminated data: a robust model-based approach
    Edoardo Fibbi
    Domenico Perrotta
    Francesca Torti
    Stefan Van Aelst
    Tim Verdonck
    Advances in Data Analysis and Classification, 2024, 18 : 121 - 161
  • [9] Model-based co-clustering for the effective handling of sparse data
    Ailem, Melissa
    Role, Francois
    Nadif, Mohamed
    PATTERN RECOGNITION, 2017, 72 : 108 - 122
  • [10] Model-based Co-clustering for High Dimensional Sparse Data
    Salah, Aghiles
    Rogovschi, Nicoleta
    Nadif, Mohamed
    ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 51, 2016, 51 : 866 - 874