UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING

被引:0
|
作者
Paliwal, Shashank [1 ]
Pudi, Vikram [1 ]
机构
[1] Int Inst Informat Technol Hyderabad, Ctr Data Engn, Hyderabad, Andhra Pradesh, India
关键词
Text document clustering; Document similarity; Term proximity; Term dependency; Feature weighting;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model which assumes that terms of a text document are independent of each other. Such single term analysis of the text completely ignores the underlying (semantic) structure of a document. In the literature, sufficient efforts have been made to enrich BOW representation using phrases and n-grams like bi-grams and tri-grams. These approaches take into account dependency only between adjacent terms or a continuous sequence of terms. However, while some of the dependencies exist between adjacent words, others are more distant. In this paper, we make an effort to enrich traditional document vector by adding the notion of term-pair features. A Term-Pair feature is a pair of two terms of the same document such that they may be adjacent to each other or distant. We investigate the process of term-pair selection and propose a methodology to select potential term-pairs from the given document. Utilizing term proximity between distant terms also allows some flexibility for two documents to be similar if they are about similar topics but with varied writing styles. Experimental results on standard web document data set show that the clustering performance is substantially improved by adding term-pair features.
引用
收藏
页码:537 / 544
页数:8
相关论文
共 50 条
  • [41] Sentence Clustering in Text Document Using Fuzzy Clustering Algorithm
    Sruthi, S.
    Shalini, L.
    2014 INTERNATIONAL CONFERENCE ON CONTROL, INSTRUMENTATION, COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICCICCT), 2014, : 1473 - 1476
  • [42] A novel squirrel search clustering algorithm for text document clustering
    Chaudhary M.
    Pruthi J.
    Jain V.K.
    Suryakant
    International Journal of Information Technology, 2022, 14 (6) : 3277 - 3286
  • [43] Text Clustering Based on Term Weights Automatic Partition
    Yu Yonghong
    Bai Wenyang
    2010 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2010), VOL 3, 2010, : 373 - 377
  • [44] A Hybrid Document Features Extraction with Clustering based Classification Framework on Large Document Sets
    Devi, S. Anjali
    Kumar, S. Siva
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (07) : 364 - 374
  • [45] A hybrid document features extraction with clustering based classification framework on large document sets
    Devi S.A.
    Kumar S.S.
    International Journal of Advanced Computer Science and Applications, 2020, 11 (07): : 364 - 374
  • [46] Term clustering and confidence measurement in document clustering
    Csorba, Kristof
    Vajk, Istvan
    Advances in Information Systems Development, Vol 1: NEW METHODS AND PRACTICE FOR THE NETWORKED SOCIETY, 2007, : 481 - 491
  • [47] Discriminative features for text document classification
    Torkkola, K
    PATTERN ANALYSIS AND APPLICATIONS, 2003, 6 (04) : 301 - 308
  • [48] Optimized Distributed Text Document Clustering Algorithm
    Judith, J. E.
    Jayakumari, J.
    ARTIFICIAL INTELLIGENCE AND EVOLUTIONARY ALGORITHMS IN ENGINEERING SYSTEMS, VOL 2, 2015, 325 : 565 - 574
  • [49] A Grey Wolf Optimizer for Text Document Clustering
    Rashaideh, Hasan
    Sawaie, Ahmad
    Al-Betar, Mohammed Azmi
    Abualigah, Laith Mohammad
    Al-laham, Mohammad M.
    Al-Khatib, Ra'ed M.
    Braik, Malik
    JOURNAL OF INTELLIGENT SYSTEMS, 2020, 29 (01) : 814 - 830
  • [50] An Apache Spark Implementation for Text Document Clustering
    Dritsas, Elias
    Trigka, Maria
    Vonitsanos, Gerasimos
    Kanavos, Andreas
    Mylonas, Phivos
    2022 17TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION & PERSONALIZATION (SMAP 2022), 2022, : 50 - 55