UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING

被引:0
|
作者
Paliwal, Shashank [1 ]
Pudi, Vikram [1 ]
机构
[1] Int Inst Informat Technol Hyderabad, Ctr Data Engn, Hyderabad, Andhra Pradesh, India
关键词
Text document clustering; Document similarity; Term proximity; Term dependency; Feature weighting;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model which assumes that terms of a text document are independent of each other. Such single term analysis of the text completely ignores the underlying (semantic) structure of a document. In the literature, sufficient efforts have been made to enrich BOW representation using phrases and n-grams like bi-grams and tri-grams. These approaches take into account dependency only between adjacent terms or a continuous sequence of terms. However, while some of the dependencies exist between adjacent words, others are more distant. In this paper, we make an effort to enrich traditional document vector by adding the notion of term-pair features. A Term-Pair feature is a pair of two terms of the same document such that they may be adjacent to each other or distant. We investigate the process of term-pair selection and propose a methodology to select potential term-pairs from the given document. Utilizing term proximity between distant terms also allows some flexibility for two documents to be similar if they are about similar topics but with varied writing styles. Experimental results on standard web document data set show that the clustering performance is substantially improved by adding term-pair features.
引用
收藏
页码:537 / 544
页数:8
相关论文
共 50 条
  • [1] Ontologies improve text document clustering
    Hotho, A
    Staab, S
    Stumme, G
    THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 541 - 544
  • [2] Frequent Term Based Text Document Clustering: A New Approach
    Kumar, Manoj
    Yadav, D. K.
    Gupta, Vijay Kumar
    2015 INTERNATIONAL CONFERENCE ON SOFT COMPUTING TECHNIQUES AND IMPLEMENTATIONS (ICSCTI), 2015,
  • [3] Text document clustering based on neighbors
    Luo, Congnan
    Li, Yanjun
    Chung, Soon M.
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (11) : 1271 - 1288
  • [4] A Hash-Based Approach for Document Retrieval by Utilizing Term Features
    Gupta, Rajeev Kumar
    Patel, Durga
    Bramhe, Ankit
    COMPUTATIONAL INTELLIGENCE IN DATA MINING, 2019, 711 : 617 - 627
  • [5] Utilizing Structure-Rich Features to Improve Clustering
    Schelling, Benjamin
    Bauer, Lena Greta Marie
    Behzadi, Sahar
    Plant, Claudia
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2020, PT I, 2021, 12457 : 91 - 107
  • [6] Ontology-based text document clustering
    Staab, S
    Hotho, A
    INTELLIGENT INFORMATION PROCESSING AND WEB MINING, 2003, : 451 - 452
  • [7] A Text Document Clustering Method Based on Ontology
    Ding, Yi
    Fu, Xian
    ADVANCES IN NEURAL NETWORKS - ISNN 2011, PT II, 2011, 6676 : 199 - 206
  • [8] Validation of text clustering based on document contents
    Toivonen, J
    Visa, A
    Vesanen, T
    Back, B
    Vanharanta, H
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2001, 2123 : 184 - 195
  • [9] Frequent Term Based Text Document Clustering Using Similarity Measures: A Novel Approach
    Gupta, Vijay Kumar
    Dutta, Maitreyee
    Kumar, Manoj
    2017 FOURTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP), 2017, : 164 - 169
  • [10] Text document clustering using global term context vectors
    Kalogeratos, Argyris
    Likas, Aristidis
    KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 31 (03) : 455 - 474