Semantic Oriented Document Clustering Using Distribution Semantics

被引:1
|
作者
Khan, Umar Ali [1 ]
Rafi, Muhammad [1 ]
机构
[1] Natl Univ & Emerging Sci, Karachi, Pakistan
关键词
Document clustering; distributional semantics; hierarchal agglomerative clustering (HAC);
D O I
10.1145/3206098.3206110
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The exponential growth of electronic form of textual documents in both public and proprietary storage force researchers to find way to efficiently extract meaningful, actionable information from these documents. Document clustering has find its niche in this area. This paper proposes a document representational model based on distributional semantics, the law of distributional semantics says that the linguist terms that appear with similar distribution in a language corpus generally have similar meaning. This representation of document model uses only those terms (linguistic feature) that have same distribution over a given collection of documents. So to find this, it is needed to find out the distributional terms by using distributional criteria and then representing the documents by only these distributional terms. A novel similarity measure is proposed over these documents that also utilized the very nature of distributional semantics in similarity calculation. Finally, hierarchal agglomerative clustering (HAC) is used to produce the final clusters. Standard text mining datasets are used to measure the effectiveness of this approach. The evaluation is based on purity of clusters and proposed approach achieved far better clustering results in comparison to conventional approach.
引用
下载
收藏
页码:14 / 18
页数:5
相关论文
共 50 条
  • [41] Time and Space Efficient Web Document Clustering Using Rayleigh Distribution
    D. Srikanth
    S. Sakthivel
    Wireless Personal Communications, 2018, 102 : 3255 - 3268
  • [42] Document Clustering Using Gravitational Ensemble Clustering
    Sadeghian, Armindokht Hashempour
    Nezamabadi-pour, Hossein
    2015 INTERNATIONAL SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND SIGNAL PROCESSING (AISP), 2015, : 240 - 245
  • [43] Using clustering for document reconstruction
    Ukovich, Anna
    Zacchigna, Alessandra
    Ramponi, Giovanni
    Schoier, Gabriella
    IMAGE PROCESSING: ALGORITHMS AND SYSTEMS, NEURAL NETWORKS, AND MACHINE LEARNING, 2006, 6064
  • [44] An Improved Genetic Algorithm for Document Clustering with Semantic Similarity Measure
    Song, Wei
    Park, Soon Cheol
    ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 1, PROCEEDINGS, 2008, : 536 - 540
  • [45] Leveraging Structural and Semantic Measures for JSON']JSON Document Clustering
    Priya, D. Uma
    Thilagam, P. Santhi
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2023, 29 (03) : 222 - 241
  • [46] Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity
    Zhu, Shanfeng
    Zeng, Jia
    Mamitsuka, Hiroshi
    BIOINFORMATICS, 2009, 25 (15) : 1944 - 1951
  • [47] Exploiting noun phrases and semantic relationships for text document clustering
    Zheng, Hai-Tao
    Kang, Bo-Yeong
    Kim, Hong-Gee
    INFORMATION SCIENCES, 2009, 179 (13) : 2249 - 2262
  • [48] A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification
    Davoudi, Saeedeh
    Mirzaei, Sayeh
    2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
  • [49] Web document clustering using Document Index Graph
    Momin, B. F.
    Kulkarni, P. J.
    Chaudhari, Amol
    2006 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATIONS, VOLS 1 AND 2, 2007, : 30 - 35
  • [50] Learning the Distribution Preserving Semantic Subspace for Clustering
    Tian, Jinyu
    Zhang, Taiping
    Qin, Anyong
    Shang, Zhaowei
    Tang, Yuan Yan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (12) : 5950 - 5965