On effective conceptual indexing and similarity search in text data

被引:13
|
作者
Aggarwal, CC [1 ]
Yu, PS [1 ]
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
D O I
10.1109/ICDM.2001.989494
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve documents most similar to user-defined sets of keywords are not applicable to targets which are medium to large size documents, because of even greater noise effects stemming from the presence of a large number of words unrelated to the overall topic in the document. The inverted representation is the dominant method for indexing text, but it is not as suitable for document-to-document similarity search, as for short user-queries. One way of improving the quality, of similarity search is Latent Semantic Indexing (LSI), which maps the documents from the original set of words to a concept space. Unfortunately, LSI maps the data into a domain in which it is not possible to provide effective indexing techniques. In this paper, we investigate new ways of providing conceptual search among documents by creating a representation in terms of conceptual word-chains. This technique also allows effective indexing techniques so that similarity queries can be performed on large collections of documents by accessing a small amount of data. We demonstrate that our scheme outperforms standard textual similarity search on the inverted representation both in terms of quality and search efficiency.
引用
收藏
页码:3 / 10
页数:8
相关论文
共 50 条
  • [41] A new indexing method for approximate search in text databases
    Shi, F
    Mefford, C
    Fifth International Conference on Computer and Information Technology - Proceedings, 2005, : 70 - 76
  • [42] Incremental indexing and its evaluation for full text search
    Yamamoto, H
    Ohmi, S
    Tsuji, H
    INFORMATION TECHNOLOGY AND ORGANIZATIONS: TRENDS, ISSUES, CHALLENGES AND SOLUTIONS, VOLS 1 AND 2, 2003, : 688 - 690
  • [43] Overlapping statistical segmentation for effective indexing of Japanese text
    Ogawa, Y
    Matsuda, T
    INFORMATION PROCESSING & MANAGEMENT, 1999, 35 (04) : 463 - 480
  • [44] An efficient similarity search based on indexing in large DNA databases
    Jeong, In-Seon
    Park, Kyoung-Wook
    Kang, Seung-Ho
    Lim, Hyeong-Seok
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2010, 34 (02) : 131 - 136
  • [45] Text similarity: an alternative way to search MEDLINE
    Lewis, James
    Ossowski, Stephan
    Hicks, Justin
    Errami, Mounir
    Garner, Harold R.
    BIOINFORMATICS, 2006, 22 (18) : 2298 - 2304
  • [46] Hierarchical indexing structure for efficient similarity search in video retrieval
    Lu, Hong
    Ooi, Beng Chin
    Shen, Heng Tao
    Xue, Xiangyang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (11) : 1544 - 1559
  • [47] ESPI Image Indexing and Similarity Search in Radon Transform Domain
    Vieux, R.
    Benois-Pineau, J.
    Domenger, J-P.
    Braquelaire, A.
    CBMI: 2009 INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING, 2009, : 231 - 236
  • [48] Efficient algorithm for sequence similarity search based on reference indexing
    Dai D.-B.
    Xiong Y.
    Zhu Y.-Y.
    Ruan Jian Xue Bao/Journal of Software, 2010, 21 (04): : 718 - 731
  • [49] Indexing schemes for similarity search in datasets of short protein fragments
    Stojmirovic, Aleksandar
    Pestov, Vladimir
    INFORMATION SYSTEMS, 2007, 32 (08) : 1145 - 1165
  • [50] Multi Feature Indexing Network MUFIN for Similarity Search Applications
    Zezula, Pavel
    SOFSEM 2012: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2012, 7147 : 77 - 87