Exploiting Document Level Semantics in Document Clustering

被引：0

作者：

Rafi, Muhammad ^{[1
]}

Sharif, Muhammad Naveed

Arshad, Waleed

Rafay, Habibullah

Mohsin, Sheharyar

Shaikh, Mohammad Shahid ^{[2
]}

机构：

[1] FAST NUCES, Dept Comp Sci, Karachi, Pakistan

[2] Habib Univ, Fac Elect Engn, Karachi, Pakistan

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2016年 / 7卷 / 06期

关键词：

Document Clustering; Text Mining; Similarity Measure; Semantics;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical alpha, syntactic beta and semantic gamma features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical alpha, syntactic beta and semantic gamma. The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: F-Measure and Purity, and have obtained encouraging results.

引用

页码：462 / 469

页数：8

共 50 条

[1] Statistical semantics for enhancing document clustering
Farahat, Ahmed K.
Kamel, Mohamed S.
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 28 (02) : 365 - 393
[2] Statistical semantics for enhancing document clustering
Ahmed K. Farahat
Mohamed S. Kamel
[J]. Knowledge and Information Systems, 2011, 28 : 365 - 393
[3] CollabSum: Exploiting multiple document clustering for collaborative single document summarizations
Institute of Computer Science and Technology, Peking University, Beijing 100871, China
[J]. Proc. Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., 2007, (143-150):
[4] Exploiting Wikipedia as External Knowledge for Document Clustering
Hu, Xiaohua
Zhang, Xiaodan
Lu, Caimei
Park, E. K.
Zhou, Xiaohua
[J]. KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2009, : 389 - 396
[5] Semantic Oriented Document Clustering Using Distribution Semantics
Khan, Umar Ali
Rafi, Muhammad
[J]. 2ND INTERNATIONAL CONFERENCE ON INFORMATION SYSTEM AND DATA MINING (ICISDM 2018), 2018, : 14 - 18
[6] Scalability Analysis of Semantics based Distributed Document Clustering Algorithms
Shah, Neepa
Mahajan, Sunita
[J]. 2017 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING, INSTRUMENTATION AND CONTROL TECHNOLOGIES (ICICICT), 2017, : 763 - 768
[7] Unsupervised Topic Aware Document-Level Semantic Representation for Document Clustering
Rafi, Muhammad
Khan, Hamza
Nadeem, Haya
Shakeel, Hassan
[J]. 2021 22ND INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT), 2021, : 170 - 179
[8] Robust Document Clustering by Exploiting Feature Diversity in Cluster Ensembles
Sevillano, Xavier
Cobo, German
Alias, Francesc
Claudi Socoro, Joan
[J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2006, (37): : 169 - 176
[9] Exploiting noun phrases and semantic relationships for text document clustering
Zheng, Hai-Tao
Kang, Bo-Yeong
Kim, Hong-Gee
[J]. INFORMATION SCIENCES, 2009, 179 (13) : 2249 - 2262
[10] A multi-level approach for document clustering
Oliveira, S
Seok, SC
[J]. COMPUTATIONAL SCIENCE - ICCS 2005, PT 1, PROCEEDINGS, 2005, 3514 : 204 - 211

← 1 2 3 4 5 →