Parameterized Complexity of Feature Selection for Categorical Data Clustering

被引:0
|
作者
Bandyapadhyay, Sayan [1 ]
Fomin, Fedor V. [2 ]
Golovach, Petr A. [2 ]
Simonov, Kirill [3 ]
机构
[1] Portland State Univ, Comp Sci Dept, POB 751, Portland, OR 97207 USA
[2] Univ Bergen, Dept Informat, POB 7803, N-5020 Bergen, Norway
[3] Hasso Plattner Inst Digital Engn gGmbH, Postfach 900460, D-14440 Potsdam, Germany
关键词
Robust clustering; PCA; Low-rank approximation; hypergraph enumeration;
D O I
10.1145/3604797
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented bym-dimensional vectors whose elements belong to a finite set of values S), we want to selectm relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify howthe complexity of the problem depends on parameters k, B, and |S|. Our main result is an algorithm that solves the Feature Selection problem in time f (k, B, | S|) center dot m.(k, | S|) center dot n2 for some functions f and.. In other words, the problem is fixed-parameter tractable parameterized by B when | S| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus, as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.
引用
收藏
页数:24
相关论文
共 50 条
  • [11] An evaluation of filter and wrapper methods for feature selection in categorical clustering
    Talavera, L
    ADVANCES IN INTELLIGENT DATA ANALYSIS VI, PROCEEDINGS, 2005, 3646 : 440 - 451
  • [12] Self-Expressive Kernel Subspace Clustering Algorithm for Categorical Data with Embedded Feature Selection
    Chen, Hui
    Xu, Kunpeng
    Chen, Lifei
    Jiang, Qingshan
    MATHEMATICS, 2021, 9 (14)
  • [13] A Mutual Information Based on Ant Colony Optimization Method to Feature Selection for Categorical Data Clustering
    Shojaee, Z.
    Fazeli, S. A. Shahzadeh
    Abbasi, E.
    Adibnia, F.
    Masuli, F.
    Rovetta, S.
    IRANIAN JOURNAL OF SCIENCE, 2023, 47 (01) : 175 - 186
  • [14] A Mutual Information Based on Ant Colony Optimization Method to Feature Selection for Categorical Data Clustering
    Z. Shojaee
    S. A. Shahzadeh Fazeli
    E. Abbasi
    F. Adibnia
    F. Masuli
    S. Rovetta
    Iranian Journal of Science, 2023, 47 : 175 - 186
  • [15] On the Parameterized Complexity of Consensus Clustering
    Doernfelder, Martin
    Guo, Jiong
    Komusiewicz, Christian
    Weller, Mathias
    ALGORITHMS AND COMPUTATION, 2011, 7074 : 624 - +
  • [16] On the parameterized complexity of consensus clustering
    Doernfelder, Martin
    Guo, Jiong
    Komusiewicz, Christian
    Weller, Mathias
    THEORETICAL COMPUTER SCIENCE, 2014, 542 : 71 - 82
  • [17] On the Parameterized Complexity of Clustering Incomplete Data into Subspaces of Small Rank
    Ganian, Robert
    Kanj, Iyad
    Ordyniak, Sebastian
    Szeider, Stefan
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 3906 - 3913
  • [18] Categorical Data Clustering with Automatic Selection of Cluster Number
    Liao, Hai-Yong
    Ng, Michael K.
    FUZZY INFORMATION AND ENGINEERING, 2009, 1 (01) : 5 - 25
  • [19] Data complexity measures in feature selection
    Okimoto, Lucas C.
    Lorena, Ana C.
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [20] Revisiting Feature Selection with Data Complexity
    Ngan Thi Dong
    Khosla, Megha
    2020 IEEE 20TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE 2020), 2020, : 211 - 216