Exploiting Document Level Semantics in Document Clustering

被引:0
|
作者
Rafi, Muhammad [1 ]
Sharif, Muhammad Naveed
Arshad, Waleed
Rafay, Habibullah
Mohsin, Sheharyar
Shaikh, Mohammad Shahid [2 ]
机构
[1] FAST NUCES, Dept Comp Sci, Karachi, Pakistan
[2] Habib Univ, Fac Elect Engn, Karachi, Pakistan
关键词
Document Clustering; Text Mining; Similarity Measure; Semantics;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical alpha, syntactic beta and semantic gamma features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical alpha, syntactic beta and semantic gamma. The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: F-Measure and Purity, and have obtained encouraging results.
引用
收藏
页码:462 / 469
页数:8
相关论文
共 50 条
  • [1] Statistical semantics for enhancing document clustering
    Farahat, Ahmed K.
    Kamel, Mohamed S.
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 28 (02) : 365 - 393
  • [2] Statistical semantics for enhancing document clustering
    Ahmed K. Farahat
    Mohamed S. Kamel
    [J]. Knowledge and Information Systems, 2011, 28 : 365 - 393
  • [3] CollabSum: Exploiting multiple document clustering for collaborative single document summarizations
    Institute of Computer Science and Technology, Peking University, Beijing 100871, China
    [J]. Proc. Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., 2007, (143-150):
  • [4] Exploiting Wikipedia as External Knowledge for Document Clustering
    Hu, Xiaohua
    Zhang, Xiaodan
    Lu, Caimei
    Park, E. K.
    Zhou, Xiaohua
    [J]. KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2009, : 389 - 396
  • [5] Semantic Oriented Document Clustering Using Distribution Semantics
    Khan, Umar Ali
    Rafi, Muhammad
    [J]. 2ND INTERNATIONAL CONFERENCE ON INFORMATION SYSTEM AND DATA MINING (ICISDM 2018), 2018, : 14 - 18
  • [6] Scalability Analysis of Semantics based Distributed Document Clustering Algorithms
    Shah, Neepa
    Mahajan, Sunita
    [J]. 2017 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING, INSTRUMENTATION AND CONTROL TECHNOLOGIES (ICICICT), 2017, : 763 - 768
  • [7] Unsupervised Topic Aware Document-Level Semantic Representation for Document Clustering
    Rafi, Muhammad
    Khan, Hamza
    Nadeem, Haya
    Shakeel, Hassan
    [J]. 2021 22ND INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT), 2021, : 170 - 179
  • [8] Robust Document Clustering by Exploiting Feature Diversity in Cluster Ensembles
    Sevillano, Xavier
    Cobo, German
    Alias, Francesc
    Claudi Socoro, Joan
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2006, (37): : 169 - 176
  • [9] Exploiting noun phrases and semantic relationships for text document clustering
    Zheng, Hai-Tao
    Kang, Bo-Yeong
    Kim, Hong-Gee
    [J]. INFORMATION SCIENCES, 2009, 179 (13) : 2249 - 2262
  • [10] A multi-level approach for document clustering
    Oliveira, S
    Seok, SC
    [J]. COMPUTATIONAL SCIENCE - ICCS 2005, PT 1, PROCEEDINGS, 2005, 3514 : 204 - 211