Mutual information, phi-squared and model-based co-clustering for contingency tables

被引：36

作者：

Govaert, Gerard ^{[1
]}

Nadif, Mohamed ^{[2
]}

机构：

[1] UTC, UMR CNRS, Heudiasyc 7253, F-60205 Compiegne, France

[2] Univ Paris 05, LIPADE, F-75006 Paris, France

来源：

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION | 2018年 / 12卷 / 03期

关键词：

Co-clustering; Biclustering; Contingency table; Information theory; 62-07; EM ALGORITHM; OPTIMIZATION; SPARSE;

D O I：

10.1007/s11634-016-0274-6

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.

引用

页码：455 / 488

页数：34

共 43 条

[1] Mutual information, phi-squared and model-based co-clustering for contingency tables
Gérard Govaert
Mohamed Nadif
Advances in Data Analysis and Classification, 2018, 12 : 455 - 488
[2] Model-based co-clustering for ordinal data
Jacques, Julien
Biernacki, Christophe
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2018, 123 : 101 - 115
[3] Model-based co-clustering for functional data
Ben Slimen, Yosra
Allio, Sylvain
Jacques, Julien
NEUROCOMPUTING, 2018, 291 : 97 - 108
[4] Model-based Poisson co-clustering for Attributed Networks
Riverain, Paul
Fossier, Simon
Nadif, Mohamed
21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 703 - 710
[5] Model-based co-clustering for mixed type data
Selosse, Margot
Jacques, Julien
Biernacki, Christophe
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2020, 144
[6] blockcluster: An R Package for Model-Based Co-Clustering
Bhatia, Parmeet Singh
Iovleff, Serge
Govaert, Gerard
JOURNAL OF STATISTICAL SOFTWARE, 2017, 76 (09): : 1 - 24
[7] Co-clustering contaminated data: a robust model-based approach
Fibbi, Edoardo
Perrotta, Domenico
Torti, Francesca
Van Aelst, Stefan
Verdonck, Tim
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2024, 18 (01) : 121 - 161
[8] Co-clustering contaminated data: a robust model-based approach
Edoardo Fibbi
Domenico Perrotta
Francesca Torti
Stefan Van Aelst
Tim Verdonck
Advances in Data Analysis and Classification, 2024, 18 : 121 - 161
[9] Model-based co-clustering for the effective handling of sparse data
Ailem, Melissa
Role, Francois
Nadif, Mohamed
PATTERN RECOGNITION, 2017, 72 : 108 - 122
[10] Model-based Co-clustering for High Dimensional Sparse Data
Salah, Aghiles
Rogovschi, Nicoleta
Nadif, Mohamed
ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 51, 2016, 51 : 866 - 874

← 1 2 3 4 5 →