On effective conceptual indexing and similarity search in text data

被引:13
|
作者
Aggarwal, CC [1 ]
Yu, PS [1 ]
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
D O I
10.1109/ICDM.2001.989494
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve documents most similar to user-defined sets of keywords are not applicable to targets which are medium to large size documents, because of even greater noise effects stemming from the presence of a large number of words unrelated to the overall topic in the document. The inverted representation is the dominant method for indexing text, but it is not as suitable for document-to-document similarity search, as for short user-queries. One way of improving the quality, of similarity search is Latent Semantic Indexing (LSI), which maps the documents from the original set of words to a concept space. Unfortunately, LSI maps the data into a domain in which it is not possible to provide effective indexing techniques. In this paper, we investigate new ways of providing conceptual search among documents by creating a representation in terms of conceptual word-chains. This technique also allows effective indexing techniques so that similarity queries can be performed on large collections of documents by accessing a small amount of data. We demonstrate that our scheme outperforms standard textual similarity search on the inverted representation both in terms of quality and search efficiency.
引用
收藏
页码:3 / 10
页数:8
相关论文
共 50 条
  • [31] Benchmark on Indexing Algorithms for Accelerating Molecular Similarity Search
    Zhu, Chun Jiang
    Song, Minghu
    Liu, Qinqing
    Becquey, Chloe
    Bi, Jinbo
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2020, 60 (12) : 6167 - 6184
  • [32] Efficient Similarity Search by Combining Indexing and Caching Strategies
    Brisaboa, Nieves R.
    Cerdeira-Pena, Ana
    Gil-Costa, Veronica
    Marin, Mauricio
    Pedreira, Oscar
    SOFSEM 2015: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2015, 8939 : 486 - 497
  • [33] Provably sensitive indexing strategies for biosequence similarity search
    Buhler, J
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2003, 10 (3-4) : 399 - 417
  • [34] Efficiently Indexing Large Sparse Graphs for Similarity Search
    Wang, Guoren
    Wang, Bin
    Yang, Xiaochun
    Yu, Ge
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (03) : 440 - 451
  • [35] Indexing large metric spaces for similarity search queries
    Bozkaya, T
    Ozsoyoglu, M
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 1999, 24 (03): : 361 - 404
  • [36] A novel indexing scheme for similarity search in metric spaces
    Tosun, Umut
    PATTERN RECOGNITION LETTERS, 2015, 54 : 69 - 74
  • [37] Motion Images: An Effective Representation of Motion Capture Data for Similarity Search
    Elias, Petr
    Sedmidubsky, Jan
    Zezula, Pavel
    SIMILARITY SEARCH AND APPLICATIONS, SISAP 2015, 2015, 9371 : 250 - 255
  • [38] Fuzzy conceptual-based search engine using conceptual semantic indexing
    NikRavesh, M
    2002 ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY PROCEEDINGS, 2002, : 146 - 151
  • [39] Retrieval by Shape Similarity with Perceptual Distance and Effective Indexing
    Berretti, Stefano
    Del Bimbo, Alberto
    Pala, Pietro
    IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (04) : 225 - 239
  • [40] NASA indexing benchmarks: evaluating text search engines
    Esler, SL
    Nelson, ML
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 1997, 20 (04) : 339 - 353