Parameterized Complexity of Feature Selection for Categorical Data Clustering

被引：0

作者：

Bandyapadhyay, Sayan ^{[1
]}

Fomin, Fedor V. ^{[2
]}

Golovach, Petr A. ^{[2
]}

Simonov, Kirill ^{[3
]}

机构：

[1] Portland State Univ, Comp Sci Dept, POB 751, Portland, OR 97207 USA

[2] Univ Bergen, Dept Informat, POB 7803, N-5020 Bergen, Norway

[3] Hasso Plattner Inst Digital Engn gGmbH, Postfach 900460, D-14440 Potsdam, Germany

来源：

ACM TRANSACTIONS ON COMPUTATION THEORY | 2023年 / 15卷 / 3-4期

关键词：

Robust clustering; PCA; Low-rank approximation; hypergraph enumeration;

D O I：

10.1145/3604797

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented bym-dimensional vectors whose elements belong to a finite set of values S), we want to selectm relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify howthe complexity of the problem depends on parameters k, B, and |S|. Our main result is an algorithm that solves the Feature Selection problem in time f (k, B, | S|) center dot m.(k, | S|) center dot n2 for some functions f and.. In other words, the problem is fixed-parameter tractable parameterized by B when | S| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus, as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.

引用

页数：24

共 50 条

[21] A nominal association matrix with feature selection for categorical data
Huang, Wenxue
Shi, Yong
Wang, Xiaogang
COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2017, 46 (16) : 7798 - 7819
[22] Regularized Feature Selection in Categorical PLS for Multicollinear Data
Mehmood, Tahir
MATHEMATICAL PROBLEMS IN ENGINEERING, 2021, 2021
[23] AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA
Doquire, Gauthier
Verleysen, Michel
KDIR 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND INFORMATION RETRIEVAL, 2011, : 394 - 401
[24] Duality between Feature Selection and Data Clustering
Chan, Chung
Al-Bashabsheh, Ali
Zhou, Qiaoqiao
Liu, Tie
2016 54TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2016, : 142 - 147
[25] Feature Selection for Clustering on High Dimensional Data
Zeng, Hong
Cheung, Yiu-ming
PRICAI 2008: TRENDS IN ARTIFICIAL INTELLIGENCE, 2008, 5351 : 913 - 922
[26] Spectral Clustering and Feature Selection for Microarray Data
Garcia-Garcia, Dario
Santos-Rodriguez, Raul
EIGHTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2009, : 425 - 428
[27] Clustering Data of Mixed Categorical and Numerical Type With Unsupervised Feature Learning
Lam, Dao
Wei, Mingzhen
Wunsch, Donald
IEEE ACCESS, 2015, 3 : 1605 - 1613
[28] Feature selection for genomic data sets through feature clustering
Zheng, Fengbin
Shen, Xiajiong
Fu, Zhengye
Zheng, Shanshan
Li, Guangrong
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2010, 4 (02) : 228 - 240
[29] Parameterized Complexity of Group Activity Selection
Lee, Hooyeon
Williams, Virginia Vassilevska
AAMAS'17: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2017, : 353 - 361
[30] Clustering categorical data using Qualified Nearest Neighbors Selection model
Jin, Yang
Zuo, Wanli
AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 1037 - +

← 1 2 3 4 5 →