Token list based information search in a multi-dimensional massive database

被引:0
|
作者
Shen, Haiying [1 ]
Li, Ze [2 ]
Li, Ting [3 ]
机构
[1] Clemson Univ, Dept Elect & Comp Engn, Clemson, SC 29634 USA
[2] MicroStrategy, Tysons Corner, Fairfax, VA 22182 USA
[3] Wal Mart Stores Inc, Bentonville, AR 72716 USA
关键词
Similarity data search; Proximity search; Locality sensitive hash; Database; SIMILARITY SEARCH; SPACES;
D O I
10.1007/s10844-013-0289-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Finding proximity information is crucial for massive database search. Locality Sensitive Hashing (LSH) is a method for finding nearest neighbors of a query point in a high-dimensional space. It classifies high-dimensional data according to data similarity. However, the "curse of dimensionality" makes LSH insufficiently effective in finding similar data and insufficiently efficient in terms of memory resources and search delays. The contribution of this work is threefold. First, we study a Token List based information Search scheme (TLS) as an alternative to LSH. TLS builds a token list table containing all the unique tokens from the database, and clusters data records having the same token together in one group. Querying is conducted in a small number of groups of relevant data records instead of searching the entire database. Second, in order to decrease the searching time of the token list, we further propose the Optimized Token list based Search schemes (OTS) based on index-tree and hash table structures. An index-tree structure orders the tokens in the token list and constructs an index table based on the tokens. Searching the token list starts from the entry of the token list supplied by the index table. A hash table structure assigns a hash ID to each token. A query token can be directly located in the token list according to its hash ID. Third, since a single-token based method leads to high overhead in the results refinement given a required similarity, we further investigate how a Multi-Token List Search scheme (MTLS) improves the performance of database proximity search. We conducted experiments on the LSH-based searching scheme, TLS, OTS, and MTLS using a massive customer data integration database. The comparison experimental results show that TLS is more efficient than an LSH-based searching scheme, and OTS improves the search efficiency of TLS. Further, MTLS per forms better than TLS when the number of tokens is appropriately chosen, and a two-token adjacent token list achieves the shortest query delay in our testing dataset.
引用
收藏
页码:567 / 594
页数:28
相关论文
共 50 条
  • [41] Multi-dimensional modeling for manufacturing process information
    Lu, Sheng-Ping
    Qiao, Li-Hong
    Zhang, Jin
    Jisuanji Jicheng Zhizao Xitong/Computer Integrated Manufacturing Systems, CIMS, 2010, 16 (12): : 2577 - 2582
  • [42] Voronoi-based Nearest Neighbor Search for Multi-Dimensional Uncertain Databases
    Zhang, Peiwu
    Cheng, Reynold
    Mamoulis, Nikos
    Renz, Matthias
    Zuefle, Andreas
    Tang, Yu
    Emrich, Tobias
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 158 - 169
  • [43] Multi-dimensional prefix matching using line search
    Waldvogel, M
    25TH ANNUAL IEEE CONFERENCE ON LOCAL COMPUTER NETWORKS - PROCEEDINGS, 2000, : 200 - 207
  • [44] EFFICIENT SIMILARITY SEARCH FOR MULTI-DIMENSIONAL TIME SEQUENCES
    Lee, Sangjun
    Park, Jisook
    INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2010, 8 (03) : 343 - 357
  • [45] Fuzzy multi-dimensional search in the Wayfinder file system
    Peery, Christopher
    Wang, Wei
    Marian, Amelie
    Nguyen, Thu D.
    2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 1588 - +
  • [46] Nearest Keyword Set Search in Multi-Dimensional Datasets
    Singh, Vishwakarma
    Zong, Bo
    Singh, Ambuj K.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (03) : 741 - 755
  • [47] An Approach to Nearest Neighboring Search for Multi-dimensional Data
    Shi, Yong
    Zhang, Li
    Zhu, Lei
    INTERNATIONAL JOURNAL OF FUTURE GENERATION COMMUNICATION AND NETWORKING, 2011, 4 (01): : 23 - 37
  • [48] Multi-dimensional Modeling of Massive Binary Interaction in Eta Carinae
    Groh, J. H.
    FROM INTERACTING BINARIES TO EXOPLANETS: ESSENTIAL MODELING TOOLS, 2012, (282): : 259 - 260
  • [49] Exploiting Massive Parallelism for Indexing Multi-Dimensional Datasets on the GPU
    Kim, Jinwoong
    Jeong, Won-Ki
    Nam, Beomseok
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (08) : 2258 - 2271
  • [50] VirtualSpectrum, a tool for simulating peak list for multi-dimensional NMR spectra
    Nielsen, Jakob Toudahl
    Nielsen, Niels Chr.
    JOURNAL OF BIOMOLECULAR NMR, 2014, 60 (01) : 51 - 66