Sparse computation for large-scale data mining

被引:0
|
作者
Hochbaum, Dorit S. [1 ]
Baumann, Philipp [1 ]
机构
[1] Univ Calif Berkeley, Etcheverry Hall, Berkeley, CA 94720 USA
基金
美国国家科学基金会;
关键词
NEAREST-NEIGHBOR; ALGORITHMS; PSEUDOFLOW;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Several leading data mining and clustering algorithms rely on inputs in the form of pairwise similarities. Yet, since the number of potential pairwise similarities grows quadratically in the size of the data set, it is computationally prohibitive to apply such algorithms to large data sets. This paper addresses this challenge with a novel method of sparse computation that computes only the relevant similarities instead of the complete similarity matrix. The method employs an efficient algorithm that provides an "approximate Principal Component Analysis". In the low-dimensional space generated, the concept of grid neighborhoods is applied in order to identify groups of objects with potentially high similarity. Unlike known sparsification approaches that generate first the full set of pairwise similarities and thus take at least quadratic time, the sparse computation method generates only the relevant similarities. Sparse computation can be utilized in any data mining or clustering algorithm that requires pairwise similarities, such as the k-nearest neighbors algorithm or the spectral method. This approach is contrasted with that of grid-based clustering algorithms in that grid neighborhoods proximity is used only to determine the entries in the sparse similarity matrix, not to identify the clusters. Indeed objects can belong to the same grid neighborhood while ending up in different clusters, or conversely, belong to different neighborhoods yet get clustered jointly. The applicability of sparse computation for binary classification is demonstrated here for the recently devised supervised normalized cut (SNC). Our empirical results show that the approach achieves a significant reduction in the density of the similarity matrix, resulting in a substantial reduction in running time, while having a minimal effect (and often none) on accuracy as compared to inputs using a complete similarity matrix.
引用
收藏
页码:354 / 363
页数:10
相关论文
共 50 条
  • [1] Sparse-Reduced Computation for Large-Scale Spectral Clustering
    Baumann, Philipp
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT (IEEM), 2016, : 1284 - 1288
  • [2] Hierarchical visual data mining for large-scale data
    Matthew Ward
    Wei Peng
    Xiaoning Wang
    [J]. Computational Statistics, 2004, 19 : 147 - 158
  • [3] Hierarchical visual data mining for large-scale data
    Ward, M
    Peng, W
    Wang, XN
    [J]. COMPUTATIONAL STATISTICS, 2004, 19 (01) : 147 - 158
  • [4] Intelligent approach for large-scale data mining
    Fouad, Khaled M.
    El-Bably, Doaa L.
    [J]. INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2020, 63 (1-2) : 93 - 113
  • [5] Entity Relation Mining in Large-Scale Data
    Li, Jingnan
    Cai, Yi
    Wang, Qixuan
    Hu, Shuyue
    Wang, Tao
    Min, Huaqing
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, 2015, 9052 : 109 - 121
  • [6] Very Sparse LSSVM Reductions for Large-Scale Data
    Mall, Raghvendra
    Suykens, Johan A. K.
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2015, 26 (05) : 1086 - 1097
  • [7] A Survey of Approximate Quantile Computation on Large-Scale Data
    Chen, Zhiwei
    Zhang, Aoqian
    [J]. IEEE ACCESS, 2020, 8 : 34585 - 34597
  • [8] The integrated delivery of large-scale data mining: The ACSys Data Mining Project
    Williams, G
    Altas, I
    Bakin, S
    Christen, P
    Hegland, M
    Marquez, A
    Milne, P
    Nagappan, R
    Roberts, S
    [J]. LARGE-SCALE PARALLEL DATA MINING, 2000, 1759 : 24 - 54
  • [9] Takeaways in Large-scale Human Mobility Data Mining
    Chen, Guangshuo
    Viana, Aline Carneiro
    Fiore, Marco
    [J]. 2018 IEEE INTERNATIONAL SYMPOSIUM ON LOCAL AND METROPOLITAN AREA NETWORKS (LANMAN), 2018, : 55 - 60
  • [10] Mining large-scale smartphone data for personality studies
    Gokul Chittaranjan
    Jan Blom
    Daniel Gatica-Perez
    [J]. Personal and Ubiquitous Computing, 2013, 17 : 433 - 450