High-Performance Geometric Algorithms for Sparse Computation in Big Data Analytics

被引:0
|
作者
Baumann, Philipp [1 ]
Hochbaum, Dorit S. [2 ]
Spaen, Quico [2 ]
机构
[1] Univ Bern, Dept Business Adm, Bern, Switzerland
[2] Univ Calif Berkeley, IEOR Dept, Berkeley, CA 94720 USA
关键词
Big data; similarity-based machine learning; sparsification; sparse computation; computational geometry;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several leading supervised and unsupervised machine learning algorithms require as input similarities between objects in a data set. Since the number of pairwise similarities grows quadratically with the size of the data set, it is computationally prohibitive to compute all pairwise similarities for large-scale data sets. The recently introduced methodology of "sparse computation" resolves this issue by computing only the relevant similarities instead of all pairwise similarities. To identify the relevant similarities, sparse computation efficiently projects the data onto a low-dimensional space where a similarity is considered relevant if the corresponding objects are close in this space. The relevant similarities are then computed in the original space. Sparse computation identifies close pairs by partitioning the low-dimensional space into grid blocks, and considering objects close if they fall in the same or adjacent grid blocks. This guarantees that all pairs of objects that are within a specified L-infinity distance are identified as well as some pairs that are within twice this distance. For very large data sets, sparse computation can have high runtime due to the enumeration of pairs of adjacent blocks. We propose here new geometric algorithms that eliminate the need to enumerate adjacent blocks. Our empirical results on data sets with up to 10 million objects show that the new algorithms achieve a significant reduction in runtime. The algorithms have applications in large-scale computational geometry and ( approximate) nearest neighbor search. Python implementations of the proposed algorithms are publicly available.
引用
收藏
页码:546 / 555
页数:10
相关论文
共 50 条
  • [21] High-Performance Computing based Scalable Online Fuzzy Clustering Algorithms for Big Data
    Jha, Preeti
    Tiwari, Aruna
    Bharill, Neha
    Ratnaparkhe, Milind
    Patel, Om Prakash
    Pulakitha, Rapolu
    Chauhan, Aditi
    [J]. 2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 1400 - 1407
  • [22] Transforming medical sciences with high-performance computing, high-performance data analytics and AI
    Lewandowski, Natalie
    Koller, Bastian
    [J]. TECHNOLOGY AND HEALTH CARE, 2023, 31 (04) : 1505 - 1507
  • [23] High performance deep learning techniques for big data analytics
    Li, Maozhen
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2018, 30 (23):
  • [24] Predictive Analytics on Genomic Data with High-Performance Computing
    Leung, Carson K.
    Sarumi, Oluwafemi A.
    Zhang, Christine Y.
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 2187 - 2194
  • [25] High-performance graph algorithms from parallel sparse matrices
    Gilbert, John R.
    Reinhardt, Steve
    Shah, Viral B.
    [J]. APPLIED PARALLEL COMPUTING: STATE OF THE ART IN SCIENTIFIC COMPUTING, 2007, 4699 : 260 - +
  • [26] Contributions to High-Performance Big Data Computing
    Fox, Geoffrey
    Qiu, Judy
    Crandall, David
    Von Laszewski, Gregor
    Beckstein, Oliver
    Paden, John
    Paraskevakos, Ioannis
    Jha, Shantenu
    Wang, Fusheng
    Marathe, Madhav
    Vullikanti, Anil
    Cheatham, Thomas
    [J]. FUTURE TRENDS OF HPC IN A DISRUPTIVE SCENARIO, 2019, 34 : 34 - 81
  • [27] High-Performance Computing for Big Data Processing
    Wu, Yulei
    Xiang, Yang
    Ge, Jingguo
    Muller, Peter
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 88 : 693 - 695
  • [28] Advanced Computation of Sparse Precision Matrices for Big Data
    Baggag, Abdelkader
    Bensmail, Halima
    Srivastava, Jaideep
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT II, 2017, 10235 : 27 - 38
  • [29] Different Clustering Algorithms for Big Data Analytics: A Review
    Dave, Meenu
    Gianey, Hemant
    [J]. PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON SYSTEM MODELING & ADVANCEMENT IN RESEARCH TRENDS (SMART-2016), 2016, : 328 - 333
  • [30] Online learning algorithms for big data analytics: A survey
    Li, Zhijie
    Li, Yuanxiang
    Wang, Feng
    He, Guoliang
    Kuang, Li
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2015, 52 (08): : 1707 - 1721