Token list based information search in a multi-dimensional massive database

被引:0
|
作者
Shen, Haiying [1 ]
Li, Ze [2 ]
Li, Ting [3 ]
机构
[1] Clemson Univ, Dept Elect & Comp Engn, Clemson, SC 29634 USA
[2] MicroStrategy, Tysons Corner, Fairfax, VA 22182 USA
[3] Wal Mart Stores Inc, Bentonville, AR 72716 USA
关键词
Similarity data search; Proximity search; Locality sensitive hash; Database; SIMILARITY SEARCH; SPACES;
D O I
10.1007/s10844-013-0289-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Finding proximity information is crucial for massive database search. Locality Sensitive Hashing (LSH) is a method for finding nearest neighbors of a query point in a high-dimensional space. It classifies high-dimensional data according to data similarity. However, the "curse of dimensionality" makes LSH insufficiently effective in finding similar data and insufficiently efficient in terms of memory resources and search delays. The contribution of this work is threefold. First, we study a Token List based information Search scheme (TLS) as an alternative to LSH. TLS builds a token list table containing all the unique tokens from the database, and clusters data records having the same token together in one group. Querying is conducted in a small number of groups of relevant data records instead of searching the entire database. Second, in order to decrease the searching time of the token list, we further propose the Optimized Token list based Search schemes (OTS) based on index-tree and hash table structures. An index-tree structure orders the tokens in the token list and constructs an index table based on the tokens. Searching the token list starts from the entry of the token list supplied by the index table. A hash table structure assigns a hash ID to each token. A query token can be directly located in the token list according to its hash ID. Third, since a single-token based method leads to high overhead in the results refinement given a required similarity, we further investigate how a Multi-Token List Search scheme (MTLS) improves the performance of database proximity search. We conducted experiments on the LSH-based searching scheme, TLS, OTS, and MTLS using a massive customer data integration database. The comparison experimental results show that TLS is more efficient than an LSH-based searching scheme, and OTS improves the search efficiency of TLS. Further, MTLS per forms better than TLS when the number of tokens is appropriately chosen, and a two-token adjacent token list achieves the shortest query delay in our testing dataset.
引用
收藏
页码:567 / 594
页数:28
相关论文
共 50 条
  • [1] Token list based information search in a multi-dimensional massive database
    Haiying Shen
    Ze Li
    Ting Li
    Journal of Intelligent Information Systems, 2014, 42 : 567 - 594
  • [2] An Investigation on Multi-Token List Based Proximity Search in Multi-Dimensional Massive Database
    Shen, Haiying
    Li, Ze
    Li, Ting
    THIRD 2008 INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY, VOL 1, PROCEEDINGS, 2008, : 593 - 598
  • [3] An Efficient Lock-free Logarithmic Search Data Structure Based on Multi-dimensional List
    Zhang, Deli
    Dechev, Damian
    PROCEEDINGS 2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS ICDCS 2016, 2016, : 281 - 292
  • [4] Organising multi-dimensional biological image information:: The BioImage Database
    Carazo, JM
    Stelzer, EHK
    Engel, A
    Fita, I
    Henn, C
    Machtynger, J
    McNeil, P
    Shotton, DM
    Chagoyen, M
    de Alarcón, PA
    Fritsch, R
    Heymann, JB
    Kalko, S
    Pittet, JJ
    Rodriguez-Tomé, P
    Boudier, T
    NUCLEIC ACIDS RESEARCH, 1999, 27 (01) : 280 - 283
  • [5] Multi-dimensional database technology based on artificial intelligence
    Liu, Yurong
    Bai, Yang
    Li, Xiongjun
    Chen, Hao
    2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336
  • [6] A Temporal Search Engine for a Massive Multi-Parameter Clinical Information Database
    Lehman, L. H.
    Kyaw, T. H.
    Clifford, G. D.
    Mark, R. G.
    COMPUTERS IN CARDIOLOGY 2007, VOL 34, 2007, 34 : 637 - +
  • [7] An efficient density-based clustering for multi-dimensional database
    Zhang, Lieliang
    Li, Zhiyang
    Liu, Weijiang
    Qu, Wenyu
    Wu, Yinan
    2017 4TH INTERNATIONAL CONFERENCE ON INFORMATION, CYBERNETICS AND COMPUTATIONAL SOCIAL SYSTEMS (ICCSS), 2017, : 361 - 366
  • [8] Multi-dimensional token ring networks: Routing and operation protocols
    Ghozati, S.A.
    Computers and Electrical Engineering, 1997, 23 (03): : 151 - 164
  • [9] Multi-dimensional token ring networks: Routing and operation protocols
    Ghozati, SA
    COMPUTERS & ELECTRICAL ENGINEERING, 1997, 23 (03) : 151 - 164
  • [10] Channel state information-based multi-dimensional parameter estimation for massive RF data in smart environments
    Yang, Xiaolong
    She, Yuan
    Xie, Liangbo
    Li, Zhaoyu
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2021, 2021 (01)