A semi-supervised document clustering algorithm based on EM

被引:9
|
作者
Rigutini, L [1 ]
Maggini, M [1 ]
机构
[1] Univ Siena, Dipartimento Ingn Informaz, I-53100 Siena, Italy
关键词
semi-supervised document clustering; EM; information gain;
D O I
10.1109/WI.2005.13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Document clustering is a very hard task in Automatic Text Processing since it requires to extract regular patterns from a document collection without a priori knowledge on the category structure. This task can be difficult also for humans because many different but valid partitions may exist for the same collection. Moreover, the lack of information about categories makes it difficult to apply effective feature selection techniques to reduce the noise in the representation of texts. Despite these intrinsic difficulties, text clustering is an important task for Web search applications in which huge collections or quite long query result lists must be automatically organized. Semi-supervised clustering lies in between automatic categorization and auto-organization. It is assumed that the supervisor is not required to specify a set of classes, but only to provide a set of texts grouped by the criteria to be used to organize the collection. In this paper we present a novel algorithm for clustering text documents which exploits the EM algorithm together with a feature selection technique based on Information Gain. The experimental results show that only very few documents are needed to initialize the clusters and that the algorithm is able to properly extract the regularities hidden in a huge unlabeled collection.
引用
收藏
页码:200 / 206
页数:7
相关论文
共 50 条
  • [41] MVS-based Semi-Supervised Clustering
    Yan, Yang
    Chen, Lihui
    Chan, Chee Keong
    [J]. 2013 9TH INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATIONS AND SIGNAL PROCESSING (ICICS), 2013,
  • [42] Semi-supervised document retrieval
    Li, Ming
    Li, Hang
    Zhou, Zhi-Hua
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2009, 45 (03) : 341 - 355
  • [43] Semi-Supervised Density-Based Clustering
    Lelis, Levi
    Sander, Joerg
    [J]. 2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009, : 842 - 847
  • [44] Semi-supervised Classification Based on Clustering Ensembles
    Chen, Si
    Guo, Gongde
    Chen, Lifei
    [J]. ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, PROCEEDINGS, 2009, 5855 : 629 - 638
  • [45] An efficient semi-supervised graph based clustering
    Viet-Vu Vu
    [J]. INTELLIGENT DATA ANALYSIS, 2018, 22 (02) : 297 - 307
  • [46] Density-based semi-supervised clustering
    Carlos Ruiz
    Myra Spiliopoulou
    Ernestina Menasalvas
    [J]. Data Mining and Knowledge Discovery, 2010, 21 : 345 - 370
  • [47] Density-based semi-supervised clustering
    Ruiz, Carlos
    Spiliopoulou, Myra
    Menasalvas, Ernestina
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2010, 21 (03) : 345 - 370
  • [48] Semi-Supervised Clustering Based on Exemplars Constraints
    Wang, Sailan
    Yang, Zhenzhi
    Yang, Jin
    Wang, Hongjun
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2017, E100D (06) : 1231 - 1241
  • [49] Semi-supervised Affinity Propagation Clustering Algorithm based on Fireworks Explosion ptimization
    Wang Limin
    Han Xuming
    Ji Qiang
    [J]. 2014 INTERNATIONAL CONFERENCE ON MANAGEMENT OF E-COMMERCE AND E-GOVERNMENT (ICMECG), 2014, : 273 - 279
  • [50] Semi-supervised Clustering Based on Artificial Bee Colony Algorithm with Kernel Strategy
    Dai, Jianhua
    Han, Huifeng
    Hu, Hu
    Hu, Qinghua
    Wei, Bingjie
    Yan, Yuejun
    [J]. Web-Age Information Management, Pt II, 2016, 9659 : 403 - 414