High-Performance Geometric Algorithms for Sparse Computation in Big Data Analytics

被引:0
|
作者
Baumann, Philipp [1 ]
Hochbaum, Dorit S. [2 ]
Spaen, Quico [2 ]
机构
[1] Univ Bern, Dept Business Adm, Bern, Switzerland
[2] Univ Calif Berkeley, IEOR Dept, Berkeley, CA 94720 USA
关键词
Big data; similarity-based machine learning; sparsification; sparse computation; computational geometry;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several leading supervised and unsupervised machine learning algorithms require as input similarities between objects in a data set. Since the number of pairwise similarities grows quadratically with the size of the data set, it is computationally prohibitive to compute all pairwise similarities for large-scale data sets. The recently introduced methodology of "sparse computation" resolves this issue by computing only the relevant similarities instead of all pairwise similarities. To identify the relevant similarities, sparse computation efficiently projects the data onto a low-dimensional space where a similarity is considered relevant if the corresponding objects are close in this space. The relevant similarities are then computed in the original space. Sparse computation identifies close pairs by partitioning the low-dimensional space into grid blocks, and considering objects close if they fall in the same or adjacent grid blocks. This guarantees that all pairs of objects that are within a specified L-infinity distance are identified as well as some pairs that are within twice this distance. For very large data sets, sparse computation can have high runtime due to the enumeration of pairs of adjacent blocks. We propose here new geometric algorithms that eliminate the need to enumerate adjacent blocks. Our empirical results on data sets with up to 10 million objects show that the new algorithms achieve a significant reduction in runtime. The algorithms have applications in large-scale computational geometry and ( approximate) nearest neighbor search. Python implementations of the proposed algorithms are publicly available.
引用
收藏
页码:546 / 555
页数:10
相关论文
共 50 条
  • [41] High-performance modelling and simulation for big data applications
    Kolodziej, Joanna
    Gonzalez-Velez, Horacio
    Karatza, Helen D.
    [J]. SIMULATION MODELLING PRACTICE AND THEORY, 2017, 76 : 1 - 2
  • [42] Data analytics and knowledge discovery on big data: Algorithms, architectures, and applications
    Wrembel, Robert
    Gamper, Johann
    [J]. DATA & KNOWLEDGE ENGINEERING, 2024, 150
  • [43] The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics
    Ordonez, Carlos
    Zhang, Yiqun
    Cabrera, Wellington
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (07) : 1905 - 1918
  • [44] A Performance Study of Big Data Analytics Platforms
    Pirzadeh, Pouria
    Carey, Michael
    Westmann, Till
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2911 - 2920
  • [45] Big Data Analytics, Firm Size, and Performance
    Conti, Raffaele
    de Matos, Miguel Godinho
    Valentini, Giovanni
    [J]. STRATEGY SCIENCE, 2023,
  • [46] High-Performance Data Analytics Beyond the Relational and Graph Data Models with GEMS
    Castellana, Vito Giovanni
    Minutoli, Marco
    Bhatt, Shreyansh
    Agarwal, Khushbu
    Bleeker, Arthur
    Feo, John
    Chavarria-Miranda, Daniel
    Haglin, David
    [J]. 2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 1029 - 1038
  • [48] Performance of the K-means and fuzzy C-means algorithms in big data analytics
    Salman Z.
    Alomary A.
    [J]. International Journal of Information Technology, 2024, 16 (1) : 465 - 470
  • [49] A Survey of Algorithms, Technologies and Issues in Big Data Analytics and Applications
    Vaidya, Gendlal M.
    Kshirsagar, Manali M.
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS 2020), 2020, : 347 - 350
  • [50] Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems
    Cheng, Peng
    Lu, Yutong
    Du, Yunfei
    Chen, Zhiguang
    [J]. SUPERCOMPUTING FRONTIERS, SCFA 2018, 2018, 10776 : 90 - 106