Sparse computation for large-scale data mining

被引：0

作者：

Hochbaum, Dorit S. ^{[1
]}

Baumann, Philipp ^{[1
]}

机构：

[1] Univ Calif Berkeley, Etcheverry Hall, Berkeley, CA 94720 USA

来源：

2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2014年

基金：

美国国家科学基金会;

关键词：

NEAREST-NEIGHBOR; ALGORITHMS; PSEUDOFLOW;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Several leading data mining and clustering algorithms rely on inputs in the form of pairwise similarities. Yet, since the number of potential pairwise similarities grows quadratically in the size of the data set, it is computationally prohibitive to apply such algorithms to large data sets. This paper addresses this challenge with a novel method of sparse computation that computes only the relevant similarities instead of the complete similarity matrix. The method employs an efficient algorithm that provides an "approximate Principal Component Analysis". In the low-dimensional space generated, the concept of grid neighborhoods is applied in order to identify groups of objects with potentially high similarity. Unlike known sparsification approaches that generate first the full set of pairwise similarities and thus take at least quadratic time, the sparse computation method generates only the relevant similarities. Sparse computation can be utilized in any data mining or clustering algorithm that requires pairwise similarities, such as the k-nearest neighbors algorithm or the spectral method. This approach is contrasted with that of grid-based clustering algorithms in that grid neighborhoods proximity is used only to determine the entries in the sparse similarity matrix, not to identify the clusters. Indeed objects can belong to the same grid neighborhood while ending up in different clusters, or conversely, belong to different neighborhoods yet get clustered jointly. The applicability of sparse computation for binary classification is demonstrated here for the recently devised supervised normalized cut (SNC). Our empirical results show that the approach achieves a significant reduction in the density of the similarity matrix, resulting in a substantial reduction in running time, while having a minimal effect (and often none) on accuracy as compared to inputs using a complete similarity matrix.

引用

页码：354 / 363

页数：10

共 50 条

[1] Sparse-Reduced Computation for Large-Scale Spectral Clustering
Baumann, Philipp
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT (IEEM), 2016, : 1284 - 1288
[2] Hierarchical visual data mining for large-scale data
Matthew Ward
Wei Peng
Xiaoning Wang
[J]. Computational Statistics, 2004, 19 : 147 - 158
[3] Hierarchical visual data mining for large-scale data
Ward, M
Peng, W
Wang, XN
[J]. COMPUTATIONAL STATISTICS, 2004, 19 (01) : 147 - 158
[4] Intelligent approach for large-scale data mining
Fouad, Khaled M.
El-Bably, Doaa L.
[J]. INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2020, 63 (1-2) : 93 - 113
[5] Entity Relation Mining in Large-Scale Data
Li, Jingnan
Cai, Yi
Wang, Qixuan
Hu, Shuyue
Wang, Tao
Min, Huaqing
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, 2015, 9052 : 109 - 121
[6] Very Sparse LSSVM Reductions for Large-Scale Data
Mall, Raghvendra
Suykens, Johan A. K.
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2015, 26 (05) : 1086 - 1097
[7] A Survey of Approximate Quantile Computation on Large-Scale Data
Chen, Zhiwei
Zhang, Aoqian
[J]. IEEE ACCESS, 2020, 8 : 34585 - 34597
[8] The integrated delivery of large-scale data mining: The ACSys Data Mining Project
Williams, G
Altas, I
Bakin, S
Christen, P
Hegland, M
Marquez, A
Milne, P
Nagappan, R
Roberts, S
[J]. LARGE-SCALE PARALLEL DATA MINING, 2000, 1759 : 24 - 54
[9] Takeaways in Large-scale Human Mobility Data Mining
Chen, Guangshuo
Viana, Aline Carneiro
Fiore, Marco
[J]. 2018 IEEE INTERNATIONAL SYMPOSIUM ON LOCAL AND METROPOLITAN AREA NETWORKS (LANMAN), 2018, : 55 - 60
[10] Mining large-scale smartphone data for personality studies
Gokul Chittaranjan
Jan Blom
Daniel Gatica-Perez
[J]. Personal and Ubiquitous Computing, 2013, 17 : 433 - 450

← 1 2 3 4 5 →