Parameterized Complexity of Feature Selection for Categorical Data Clustering

被引:0
|
作者
Bandyapadhyay, Sayan [1 ]
Fomin, Fedor V. [2 ]
Golovach, Petr A. [2 ]
Simonov, Kirill [3 ]
机构
[1] Portland State Univ, Comp Sci Dept, POB 751, Portland, OR 97207 USA
[2] Univ Bergen, Dept Informat, POB 7803, N-5020 Bergen, Norway
[3] Hasso Plattner Inst Digital Engn gGmbH, Postfach 900460, D-14440 Potsdam, Germany
关键词
Robust clustering; PCA; Low-rank approximation; hypergraph enumeration;
D O I
10.1145/3604797
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented bym-dimensional vectors whose elements belong to a finite set of values S), we want to selectm relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify howthe complexity of the problem depends on parameters k, B, and |S|. Our main result is an algorithm that solves the Feature Selection problem in time f (k, B, | S|) center dot m.(k, | S|) center dot n2 for some functions f and.. In other words, the problem is fixed-parameter tractable parameterized by B when | S| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus, as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.
引用
收藏
页数:24
相关论文
共 50 条
  • [1] Parameterized Complexity of Categorical Clustering with Size Constraints
    Fomin, Fedor, V
    Golovach, Petr A.
    Purohit, Nidhi
    ALGORITHMS AND DATA STRUCTURES, WADS 2021, 2021, 12808 : 385 - 398
  • [2] Parameterized complexity of categorical clustering with size constraints
    Fomin F.V.
    Golovach P.A.
    Purohit N.
    Journal of Computer and System Sciences, 2023, 136 : 171 - 194
  • [3] Feature selection for clustering categorical data with an embedded modelling approach
    Silvestre, Claudia
    Cardoso, Margarida G. M. S.
    Figueiredo, Mario
    EXPERT SYSTEMS, 2015, 32 (03) : 444 - 453
  • [4] The Parameterized Complexity of Clustering Incomplete Data
    Eiben, Eduard
    Ganian, Robert
    Kanj, Iyad
    Ordyniak, Sebastian
    Szeider, Stefan
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 7296 - 7304
  • [5] On the parameterized complexity of clustering problems for incomplete data
    Eiben, Eduard
    Ganian, Robert
    Kanj, Iyad
    Ordyniak, Sebastian
    Szeider, Stefan
    JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2023, 134 : 1 - 19
  • [6] A Parameterized Complexity Analysis of Combinatorial Feature Selection Problems
    Froese, Vincent
    van Bevern, Rene
    Niedermeier, Rolf
    Sorge, Manuel
    MATHEMATICAL FOUNDATIONS OF COMPUTER SCIENCE 2013, 2013, 8087 : 445 - 456
  • [7] Subspace Clustering with Feature Grouping for Categorical Data
    Jia, Hong
    Dong, Menghan
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, KSEM 2023, 2023, 14117 : 247 - 254
  • [8] Clustering and variable selection for categorical multivariate data
    Bontemps, Dominique
    Toussile, Wilson
    ELECTRONIC JOURNAL OF STATISTICS, 2013, 7 : 2344 - 2371
  • [9] Coupling learning for feature selection in categorical data
    Feng Wang
    Jiye Liang
    Peng Song
    International Journal of Machine Learning and Cybernetics, 2023, 14 : 2455 - 2465
  • [10] Coupling learning for feature selection in categorical data
    Wang, Feng
    Liang, Jiye
    Song, Peng
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (07) : 2455 - 2465