A New Retrieval Model Based on TextTiling for Document Similarity Search

被引:0
|
作者
Xiao-Jun Wan
Yu-Xin Peng
机构
[1] Peking University,National Key Laboratory of Text Processing Technology, Institute of Computer Science and Technology
关键词
document similarity search; retrieval model; similarity measure; TextTiling; optimal matching;
D O I
暂无
中图分类号
学科分类号
摘要
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.
引用
收藏
页码:552 / 558
页数:6
相关论文
共 50 条
  • [1] A new retrieval model based on TextTiling for document similarity search
    Wan, XJ
    Peng, YX
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2005, 20 (04) : 552 - 558
  • [2] Job information retrieval based on document similarity
    Wang, Jingfan
    Xia, Yunqing
    Zheng, Thomas Fang
    Wu, Xiaojun
    INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 165 - +
  • [3] Document similarity search based on generic summaries
    Wan, XJ
    Yang, JW
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2005, 3689 : 635 - 640
  • [4] Retrieval of document images based on page layout similarity
    Naveen
    Guru, D. S.
    ADAPTIVE MULTIMEDIA RETRIEVAL: USER, CONTEXT, AND FEEDBACK, 2007, 4398 : 136 - +
  • [5] Ranking invariance based on similarity measures in document retrieval
    Omhover, JF
    Rifqi, M
    Detyniecki, M
    ADAPTIVE MULTIMEDIA RETRIEVAL: USER, CONTEXT, AND FEEDBACK, 2006, 3877 : 55 - 64
  • [6] DOCODE-Lite: A Meta-Search Engine for Document Similarity Retrieval
    Bravo-Marquez, Felipe
    L'Huillier, Gaston
    Rios, Sebastian A.
    Velasquez, Juan D.
    Guerrero, Luis A.
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT II, 2010, 6277 : 93 - +
  • [7] Document Image Retrieval Based on Texture Features and Similarity Fusion
    Alaei, Fahimeh
    Alaei, Alireza
    Blumenstein, Michael
    Pal, Umapada
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON IMAGE AND VISION COMPUTING NEW ZEALAND (IVCNZ), 2016, : 128 - 133
  • [8] Divergence-based similarity measure for spoken document retrieval
    Liu, Peng
    Soong, Frank K.
    Zhou, Jian-Lai
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 89 - +
  • [9] Supervised rank aggregation based on query similarity for document retrieval
    Yang Wang
    Yalou Huang
    Xiaodong Pang
    Min Lu
    Maoqiang Xie
    Jie Liu
    Soft Computing, 2013, 17 : 421 - 429
  • [10] Graph-based Similarity for Document Retrieval in the Biomedical Domain
    Zuluaga, Adelaida A.
    Rosso, Andres A.
    PROCEEDINGS OF 2022 7TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING TECHNOLOGIES, ICMLT 2022, 2022, : 180 - 184