A semi-supervised document clustering algorithm based on EM

被引:9
|
作者
Rigutini, L [1 ]
Maggini, M [1 ]
机构
[1] Univ Siena, Dipartimento Ingn Informaz, I-53100 Siena, Italy
关键词
semi-supervised document clustering; EM; information gain;
D O I
10.1109/WI.2005.13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering is a very hard task in Automatic Text Processing since it requires to extract regular patterns from a document collection without a priori knowledge on the category structure. This task can be difficult also for humans because many different but valid partitions may exist for the same collection. Moreover, the lack of information about categories makes it difficult to apply effective feature selection techniques to reduce the noise in the representation of texts. Despite these intrinsic difficulties, text clustering is an important task for Web search applications in which huge collections or quite long query result lists must be automatically organized. Semi-supervised clustering lies in between automatic categorization and auto-organization. It is assumed that the supervisor is not required to specify a set of classes, but only to provide a set of texts grouped by the criteria to be used to organize the collection. In this paper we present a novel algorithm for clustering text documents which exploits the EM algorithm together with a feature selection technique based on Information Gain. The experimental results show that only very few documents are needed to initialize the clusters and that the algorithm is able to properly extract the regularities hidden in a huge unlabeled collection.
引用
收藏
页码:200 / 206
页数:7
相关论文
共 50 条
  • [1] A robust semi-supervised EM-based clustering algorithm with a reject option
    Saint-Jean, C
    Frélicot, C
    [J]. 16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 399 - 402
  • [2] Semi-supervised fuzzy co-clustering algorithm for document categorization
    Yang Yan
    Lihui Chen
    William-Chandra Tjhi
    [J]. Knowledge and Information Systems, 2013, 34 : 55 - 74
  • [3] Semi-supervised fuzzy co-clustering algorithm for document categorization
    Yan, Yang
    Chen, Lihui
    Tjhi, William-Chandra
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 34 (01) : 55 - 74
  • [4] Semi-supervised concept factorization for document clustering
    Lu, Mei
    Zhao, Xiang-Jun
    Zhang, Li
    Li, Fan-Zhang
    [J]. INFORMATION SCIENCES, 2016, 331 : 86 - 98
  • [5] A Semi-supervised Clustering Algorithm Based on Rough Reduction
    Lin, Liandong
    Qu, Wei
    Yu, Xiang
    [J]. CCDC 2009: 21ST CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-6, PROCEEDINGS, 2009, : 5427 - +
  • [6] A semi-supervised framework for concept-based hierarchical document clustering
    Seyed Mojtaba Sadjadi
    Hoda Mashayekhi
    Hamid Hassanpour
    [J]. World Wide Web, 2023, 26 : 3861 - 3890
  • [7] Semi-supervised Document Clustering Based on Latent Dirichlet Allocation (LDA)
    秦永彬
    李解
    黄瑞章
    李晶
    [J]. Journal of Donghua University(English Edition), 2016, 33 (05) : 685 - 688
  • [8] Semi-supervised model-based document clustering: A comparative study
    Shi Zhong
    [J]. Machine Learning, 2006, 65 : 3 - 29
  • [9] Semi-supervised model-based document clustering: A comparative study
    Zhong, Shi
    [J]. MACHINE LEARNING, 2006, 65 (01) : 3 - 29
  • [10] A semi-supervised framework for concept-based hierarchical document clustering
    Sadjadi, Seyed Mojtaba
    Mashayekhi, Hoda
    Hassanpour, Hamid
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (06): : 3861 - 3890