On effective conceptual indexing and similarity search in text data

被引:13
|
作者
Aggarwal, CC [1 ]
Yu, PS [1 ]
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
D O I
10.1109/ICDM.2001.989494
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve documents most similar to user-defined sets of keywords are not applicable to targets which are medium to large size documents, because of even greater noise effects stemming from the presence of a large number of words unrelated to the overall topic in the document. The inverted representation is the dominant method for indexing text, but it is not as suitable for document-to-document similarity search, as for short user-queries. One way of improving the quality, of similarity search is Latent Semantic Indexing (LSI), which maps the documents from the original set of words to a concept space. Unfortunately, LSI maps the data into a domain in which it is not possible to provide effective indexing techniques. In this paper, we investigate new ways of providing conceptual search among documents by creating a representation in terms of conceptual word-chains. This technique also allows effective indexing techniques so that similarity queries can be performed on large collections of documents by accessing a small amount of data. We demonstrate that our scheme outperforms standard textual similarity search on the inverted representation both in terms of quality and search efficiency.
引用
收藏
页码:3 / 10
页数:8
相关论文
共 50 条
  • [1] Extract salient words with WordRank for effective similarity search in text data
    Wan, XJ
    Yang, JW
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 590 - 591
  • [2] Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data
    Rheinlaender, Astrid
    Knobloch, Martin
    Hochmuth, Nicky
    Leser, Ulf
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2010, 6187 : 519 - 536
  • [3] Effective indexing and filtering for similarity search in large biosequence databases
    Ozturk, O
    Ferhatosmanoglu, H
    THIRD IEEE SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING - BIBE 2003, PROCEEDINGS, 2003, : 359 - 366
  • [4] Indexing of Motion Capture Data for Efficient and Fast Similarity Search
    Li, Chuanjun
    Prabhakaran, B.
    JOURNAL OF COMPUTERS, 2006, 1 (03) : 35 - 42
  • [5] Augmenting phrase-based text representation with conceptual indexing for effective retrieval
    Sharma, R
    Raj, PCR
    Raman, S
    IKE'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2003, : 27 - 31
  • [6] Bidirectional String Anchors for Improved Text Indexing and Top-K Similarity Search
    Loukides, Grigorios
    Pissis, Solon P.
    Sweering, Michelle
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (11) : 11093 - 11111
  • [7] Efficient Metric Indexing for Similarity Search and Similarity Joins
    Chen, Lu
    Gao, Yunjun
    Li, Xinhan
    Jensen, Christian S.
    Chen, Gang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 556 - 571
  • [8] Efficient Metric Indexing for Similarity Search
    Chen, Lu
    Gao, Yunjun
    Li, Xinhan
    Jensen, Christian S.
    Chen, Gang
    2015 IEEE 31ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2015, : 591 - 602
  • [9] Metric Indexing for Graph Similarity Search
    Bause, Franka
    Blumenthal, David B.
    Schubert, Erich
    Kriege, Nils M.
    SIMILARITY SEARCH AND APPLICATIONS, SISAP 2021, 2021, 13058 : 323 - 336
  • [10] Automatic Indexing for Similarity Search in ELKI
    Schubert, Erich
    SIMILARITY SEARCH AND APPLICATIONS (SISAP 2022), 2022, 13590 : 205 - 213