A Discriminative Framework for Clustering via Similarity Functions

被引：0

作者：

Balcan, Maria-Florina ^{[1
]}

Blum, Avrim ^{[1
]}

Vempala, Santosh ^{[2
]}

机构：

[1] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA

[2] Georgia Inst Technol, Coll Comp, Atlanta, GA USA

来源：

STOC'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL SYMPOSIUM ON THEORY OF COMPUTING | 2008年

基金：

美国国家科学基金会;

关键词：

Clustering; Similarity Functions; Learning;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as ground-truth and then design algorithms to (approximately) optimize various graph-based objective functions. However, in most applications, this similarity information is merely based oil some heuristic; the ground truth is really the unknown correct clustering of the data points and the real goal is to achieve low error on the data. In this work, we develop a theoretical approach to clustering from this perspective. In particular, motivated by recent work in learning theory that asks "what natural properties of a similarity (or kernel) function are sufficient to be able to learn well?" we ask "what natural properties of a similarity function are sufficient to be able to cluster well?" To study this question we develop a theoretical framework that can be viewed as an analog of the PAC learning model for clustering, where the object of study, rather than being a concept class, is a class of (concept, similarity function) pairs, or equivalently, a property the similarity function should satisfy with respect to the ground truth clustering. We then analyze both algorithmic and information theoretic issues in our model. While quite strong properties are needed if the goal is to produce a single approximately-correct clustering, we find that a number of reasonable properties are sufficient under two natural relaxations: (a) list clustering: analogous to the notion of list-decoding, the algorithm can produce a small list of clusterings (which a user can select from) and (b) hierarchical clustering: the algorithm's goal is to produce a hierarchy such that desired clustering is some pruning of this tree (which a user could navigate). We develop a notion of the clustering complexity of a given property (analogous to notions of capacity in learning theory), that characterizes its information-theoretic usefulness for clustering. We analyze this quantity for several natural game-theoretic and learning-theoretic properties, as well as design new efficient algorithms that are able to take advantage of them. Our algorithms for hierarchical clustering combine recent learning-theoretic approaches with linkage-style methods. We also show how our algorithms can be extended to the inductive case, i.e., by using just a constant-sized sample, as in property testing. The analysis here uses regularity-type results of [20] and [3].

引用

页码：671 / +

页数：2

共 50 条

[21] Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis
Lensen, Andrew
Xue, Bing
Zhang, Mengjie
EVOLUTIONARY COMPUTATION, 2020, 28 (04) : 531 - 561
[22] Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions
Mai, Te-Lun
Hu, Geng-Ming
Chen, Chi-Ming
JOURNAL OF PROTEOME RESEARCH, 2016, 15 (07) : 2123 - 2131
[23] Adaptive clustering federated learning via similarity acceleration
Zhu S.
Gu B.
Sun G.
Tongxin Xuebao/Journal on Communications, 45 (03): : 197 - 207
[24] Using clustering to learn distance functions for supervised similarity assessment
Eick, Christoph F.
Rouhana, Alain
Bagherjeiran, A.
Vilalta, R.
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2006, 19 (04) : 395 - 401
[25] SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering
Ribeiro, Leonardo Andrade
Cuzzocrea, Alfredo
Alves Bezerra, Karen Aline
do Nascimento, Ben Hur Bahia
PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL 1 (ICEIS), 2016, : 75 - 80
[26] Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework
Ribeiro, Leonardo Andrade
Cuzzocrea, Alfredo
Alves Bezerra, Karen Aline
Bahia do Nascimento, Ben Hur
DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2016, PT I, 2016, 9827 : 185 - 204
[27] Genetic Programming for Evolving Similarity Functions Tailored to Clustering Algorithms
Andersen, Hayden
Lensen, Andrew
Xue, Bing
2021 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC 2021), 2021, : 688 - 695
[28] Using clustering to learn distance functions for supervised similarity assessment
Eick, CF
Rouhana, A
Bagherjeiran, A
Vilalta, R
MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, PROCEEDINGS, 2005, 3587 : 120 - 131
[29] Microblog Friends Automatic Clustering Framework based on Similarity Measurement
Wang, Chenxu
Guan, Xiaohong
Qin, Tao
2014 11TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), 2014, : 5152 - 5157
[30] DISCRIMINATIVE EXEMPLAR CLUSTERING
Yang, Yingzhen
Liang, Feng
Huang, Thomas S.
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,

← 1 2 3 4 5 →