Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

被引:10
|
作者
Wang, Yiqiu [1 ]
Shrivastava, Anshumali [1 ]
Wang, Jonathan [1 ]
Ryu, Junghee [1 ]
机构
[1] Rice Univ, Houston, TX 77251 USA
关键词
Similarity search; locality sensitive hashing; reservoir sampling; GPGPU;
D O I
10.1145/3183713.3196925
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present FLASH (Fast LSH Algorithm for Similarity search accelerated with HPC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n(2)D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results(1).
引用
收藏
页码:889 / 903
页数:15
相关论文
共 22 条
  • [1] Accelerating Exact Similarity Search on CPU-GPU Systems
    Matsumoto, Takazumi
    Yiu, Man Lung
    2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 320 - 329
  • [2] HEGJoin: Heterogeneous CPU-GPU Epsilon Grids for Accelerated Distance Similarity Join
    Gallet, Benoit
    Gowanlock, Michael
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2020), PT III, 2020, 12114 : 372 - 388
  • [3] Approximate similarity search for online multimedia services on distributed CPU-GPU platforms
    Teodoro, George
    Valle, Eduardo
    Mariano, Nathan
    Torres, Ricardo
    Meira, Wagner, Jr.
    Saltz, Joel H.
    VLDB JOURNAL, 2014, 23 (03): : 427 - 448
  • [4] An adaptive algorithm for high-dimensional integrals on heterogeneous CPU-GPU systems
    Laccetti, Giuliano
    Lapegna, Marco
    Mele, Valeria
    Montella, Raffaele
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (19):
  • [5] Efficient Histogram-Based Similarity Search in Ultra-High Dimensional Space
    Liu, Jiajun
    Huang, Zi
    Shen, Heng Tao
    Zhou, Xiaofang
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT II, 2011, 6588 : 1 - 15
  • [6] A Hybrid CPU-GPU Accelerated Framework for Fast Mapping of High-Resolution Human Brain Connectome
    Wang, Yu
    Du, Haixiao
    Xia, Mingrui
    Ren, Ling
    Xu, Mo
    Xie, Teng
    Gong, Gaolang
    Xu, Ningyi
    Yang, Huazhong
    He, Yong
    PLOS ONE, 2013, 8 (05):
  • [7] GPU Accelerated Finite Element Simulation for Ultra-High Strength Steel quenching
    Wang Chao
    Zhu Bin
    Wang Liang
    Wang Yilin
    Zhang Yisheng
    MATERIALS, MECHANICAL AND MANUFACTURING ENGINEERING, 2014, 842 : 337 - 340
  • [8] Hybrid (CPU/GPU) Exact Nearest Neighbors Search in High-Dimensional Spaces
    Muhr, David
    Affenzeller, Michael
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2022, PART II, 2022, 647 : 112 - 123
  • [9] High-speed, two-dimensional digital image correlation algorithm using heterogeneous (CPU-GPU) framework
    Thiagu, Mullai
    Subramanian, Sankara J.
    Nasre, Rupesh
    STRAIN, 2020, 56 (03)
  • [10] Effective and Efficient Algorithms for Flexible Aggregate Similarity Search in High Dimensional Spaces
    Houle, Michael E.
    Ma, Xiguo
    Oria, Vincent
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (12) : 3258 - 3273