UTILIZING TERM PROXIMITY BASED FEATURES TO IMPROVE TEXT DOCUMENT CLUSTERING

被引:0
|
作者
Paliwal, Shashank [1 ]
Pudi, Vikram [1 ]
机构
[1] Int Inst Informat Technol Hyderabad, Ctr Data Engn, Hyderabad, Andhra Pradesh, India
关键词
Text document clustering; Document similarity; Term proximity; Term dependency; Feature weighting;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model which assumes that terms of a text document are independent of each other. Such single term analysis of the text completely ignores the underlying (semantic) structure of a document. In the literature, sufficient efforts have been made to enrich BOW representation using phrases and n-grams like bi-grams and tri-grams. These approaches take into account dependency only between adjacent terms or a continuous sequence of terms. However, while some of the dependencies exist between adjacent words, others are more distant. In this paper, we make an effort to enrich traditional document vector by adding the notion of term-pair features. A Term-Pair feature is a pair of two terms of the same document such that they may be adjacent to each other or distant. We investigate the process of term-pair selection and propose a methodology to select potential term-pairs from the given document. Utilizing term proximity between distant terms also allows some flexibility for two documents to be similar if they are about similar topics but with varied writing styles. Experimental results on standard web document data set show that the clustering performance is substantially improved by adding term-pair features.
引用
收藏
页码:537 / 544
页数:8
相关论文
共 50 条
  • [31] UTILIZING IMAGE-BASED FEATURES IN BIOMEDICAL DOCUMENT CLASSIFICATION
    Ma, Kaidi
    Jeong, Hogyeong
    Rohith, M., V
    Somanath, Gowri
    Tarpine, Ryan
    Schutter, Kyle
    Blostein, Dorothea
    Istrail, Sorin
    Kambhamettu, Chandra
    Shatkay, Hagit
    2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 4451 - 4455
  • [32] Performance Evaluation of Semantic Based and Ontology Based Text Document Clustering Techniques
    Punitha, S. C.
    Punithavalli, M.
    INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY AND SYSTEM DESIGN 2011, 2012, 30 : 100 - 106
  • [33] Exploiting category information and document information to improve term weighting for text categorization
    Li, Jingyang
    Sun, Maosong
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2007, 4394 : 587 - +
  • [34] A Weighted Topical Document Embedding based Clustering Method for News Text
    Zhu Dechao
    Song Hui
    2016 IEEE INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2016, : 1060 - 1065
  • [35] Adaptive Centroid-based Clustering Algorithm for Text Document Data
    Li, Ximing
    Ouyang, Jihong
    Zhou, Xiaotang
    Fu, Bo
    2014 SIXTH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS AND PROGRAMMING (PAAP), 2014, : 63 - 68
  • [36] Evaluation of text document clustering approach based on particle swarm optimization
    Karol, Stuti
    Mangat, Veenu
    OPEN COMPUTER SCIENCE, 2013, 3 (02): : 69 - 90
  • [37] Evaluation of a Text Document Clustering Approach based on Particle Swarm Optimization
    Karol, Stuti
    Mangat, Veenu
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2013, 13 (07): : 130 - 143
  • [38] Text Document Clustering: The Application of Cluster Analysis to Textual Document
    2016, Institute of Electrical and Electronics Engineers Inc., United States
  • [39] Text Document Clustering: The Application of Cluster Analysis to Textual Document
    Reddy, Venkata Srikanth
    Kinnicutt, Patrick
    Lee, Roger
    2016 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE & COMPUTATIONAL INTELLIGENCE (CSCI), 2016, : 1174 - 1179
  • [40] Multi-type features based Web document clustering
    Huang, S
    Xue, GR
    Zhang, BY
    Chen, Z
    Yu, Y
    Ma, WY
    WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS, 2004, 3306 : 253 - 265