Efficient phrase-based document indexing for web document clustering

被引:163
|
作者
Hammouda, KM [1 ]
Kamel, MS [1 ]
机构
[1] Univ Waterloo, Dept Syst Design Engn, Waterloo, ON N2L 3G1, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Web mining; document similarity; phrase-based indexing; document clustering; document structure; document index graph; phrase matching;
D O I
10.1109/TKDE.2004.58
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
引用
收藏
页码:1279 / 1296
页数:18
相关论文
共 50 条
  • [1] Efficient phrase-based document similarity for clustering
    Chim, Hung
    Deng, Xiaotie
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (09) : 1217 - 1229
  • [2] Efficient Incremental Phrase-Based Document Clustering
    Bakr, Ahmad M.
    Yousri, Noha A.
    Ismail, Mohamed A.
    [J]. 2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 517 - 520
  • [3] Phrase Based Web Document Clustering: An Indexing Approach
    Singh, Amit Prakash
    Srivastava, Shalini
    Sahu, Sanjib Kumar
    [J]. COMPUTER COMMUNICATION, NETWORKING AND INTERNET SECURITY, 2017, 5 : 481 - 492
  • [4] Hybrid distance based document clustering with keyword and phrase indexing
    Subhadra, K.
    Shashi, M.
    [J]. International Journal of Computer Science Issues, 2012, 9 (02): : 345 - 350
  • [5] Document Classification Efficiency of Phrase-Based Techniques
    Kapalavayi, Nagesh
    Murthy, S. N. Jayaram
    Hu, Gongzhu
    [J]. 2009 IEEE/ACS INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, VOLS 1 AND 2, 2009, : 174 - 178
  • [6] Phrase-based document similarity based on an Index Graph model
    Hammouda, KM
    Kamel, MS
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 203 - 210
  • [7] A Well Organized Phrase-Based Document Clustering Using ASCII Values and Adjacency List
    Lukka, Srikanth
    Shaik, Rizwana
    [J]. PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR 2016), 2018, 614 : 113 - 120
  • [8] Semantic key phrase-based model for document management
    Bafna, Prafulla
    Pramod, Dhanya
    Shrwaikar, Shailaja
    Hassan, Atiya
    [J]. BENCHMARKING-AN INTERNATIONAL JOURNAL, 2019, 26 (06) : 1709 - 1727
  • [9] Phrase-based hierarchical clustering of web search results
    Maslowska, I
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2003, 2633 : 555 - 562
  • [10] A Phrase-Based Method for Hierarchical Clustering of Web Snippets
    Li, Zhao
    Wu, Xindong
    [J]. PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1947 - 1948