On effective conceptual indexing and similarity search in text data

被引:13
|
作者
Aggarwal, CC [1 ]
Yu, PS [1 ]
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
D O I
10.1109/ICDM.2001.989494
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similarity search in text has proven to be an interesting problem from the qualitative perspective because of inherent redundancies and ambiguities in textual descriptions. The methods used in search engines in order to retrieve documents most similar to user-defined sets of keywords are not applicable to targets which are medium to large size documents, because of even greater noise effects stemming from the presence of a large number of words unrelated to the overall topic in the document. The inverted representation is the dominant method for indexing text, but it is not as suitable for document-to-document similarity search, as for short user-queries. One way of improving the quality, of similarity search is Latent Semantic Indexing (LSI), which maps the documents from the original set of words to a concept space. Unfortunately, LSI maps the data into a domain in which it is not possible to provide effective indexing techniques. In this paper, we investigate new ways of providing conceptual search among documents by creating a representation in terms of conceptual word-chains. This technique also allows effective indexing techniques so that similarity queries can be performed on large collections of documents by accessing a small amount of data. We demonstrate that our scheme outperforms standard textual similarity search on the inverted representation both in terms of quality and search efficiency.
引用
收藏
页码:3 / 10
页数:8
相关论文
共 50 条
  • [21] An indexing weight for voice-to-text search
    Liu, Chen
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 3007 - 3010
  • [22] Effective similarity search methods for large video data streams
    Lee, SL
    Chun, SJ
    Lee, JH
    COMPUTATIONAL SCIENCE - ICCS 2003, PT IV, PROCEEDINGS, 2003, 2660 : 1030 - 1039
  • [23] Tree Based Fast Similarity Query Search Indexing on Outsourced Cloud Data Streams
    Balasubramanian, Balamurugan
    Durai, Kamalraj
    Sathyanarayanan, Jegadeeswari
    Muthukumarasamy, Sugumaran
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2019, 16 (05) : 871 - 878
  • [24] Batch Text Similarity Search with MapReduce
    Li, Rui
    Ju, Li
    Peng, Zhuo
    Yu, Zhiwei
    Wang, Chaokun
    WEB TECHNOLOGIES AND APPLICATIONS, 2011, 6612 : 412 - +
  • [25] Geometric Graph Indexing for Similarity Search in Scientific Databases
    Armiti, Ayser
    Gertz, Michael
    28TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT (SSDBM) 2016), 2016,
  • [26] Search processing by similarity. Parallelization and Indexing Technologies
    Dos Santos, Eder
    Sofia Albert, A. Osiris
    Uribe Paredes, Roberto
    INFORMES CIENTIFICOS Y TECNICOS, 2015, 7 (02): : 107 - 136
  • [27] Local Similarity Search for Unstructured Text
    Wang, Pei
    Xiao, Chuan
    Qin, Jianbin
    Wang, Wei
    Zhang, Xiaoyang
    Ishikawa, Yoshiharu
    SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 1991 - 2005
  • [28] Continuous Similarity Search for Text Sets
    Tsuchida, Yuma
    Kubo, Kohei
    Koga, Hisashi
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2022, PT II, 2022, 13427 : 229 - 234
  • [29] PSI: indexing protein structures for fast similarity search
    Camoglu, Orhan
    Kahveci, Tamer
    Singh, Ambuj K.
    BIOINFORMATICS, 2003, 19 : i81 - i83
  • [30] Hardness of string similarity search and other indexing problems
    Sahinalp, SC
    Utis, A
    AUTOMATA , LANGUAGES AND PROGRAMMING, PROCEEDINGS, 2004, 3142 : 1080 - 1098