A vector reconstruction based clustering algorithm particularly for large-scale text collection

被引:2
|
作者
Liu, Ming [1 ,2 ]
Wu, Chong [1 ]
Chen, Lei [3 ]
机构
[1] Sch Management, Harbin, Peoples R China
[2] Sch Comp Sci & Technol, Harbin, Peoples R China
[3] Beijing Normal Univ, Int Business Fac, Zhuhai, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Vector reconstruction; Large-scale text clustering; Partial tuning sub-process; Overall tuning sub-process; SELF-ORGANIZING MAPS; MUTUAL INFORMATION; WEIGHT; SELECTION; ENTROPY;
D O I
10.1016/j.neunet.2014.10.012
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Along with the fast evolvement of internet technology, internet users have to face the large amount of textual data every day. Apparently, organizing texts into categories can help users dig the useful information from large-scale text collection. Clustering is one of the most promising tools for categorizing texts due to its unsupervised characteristic. Unfortunately, most of traditional clustering algorithms lose their high qualities on large-scale text collection, which mainly attributes to the high-dimensional vector space and semantic similarity among texts. To effectively and efficiently cluster large-scale text collection, this paper puts forward a vector reconstruction based clustering algorithm. Only the features that can represent the cluster are preserved in cluster's representative vector. This algorithm alternately repeats two sub-processes until it converges. One process is partial tuning sub-process, where feature's weight is fine-tuned by iterative process similar to self-organizing-mapping (SOM) algorithm. To accelerate clustering velocity, an intersection based similarity measurement and its corresponding neuron adjustment function are proposed and implemented in this sub-process. The other process is overall tuning sub-process, where the features are reallocated among different clusters. In this sub-process, the features useless to represent the cluster are removed from cluster's representative vector. Experimental results on the three text collections (including two small-scale and one large-scale text collections) demonstrate that our algorithm obtains high-quality performances on both small-scale and large-scale text collections. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:141 / 155
页数:15
相关论文
共 50 条
  • [1] A Novel Clustering Algorithm and Its Incremental Version for Large-Scale Text Collection
    Chen, Lei
    Liu, Ming
    Wu, Chong
    Xu, Ai
    [J]. INFORMATION TECHNOLOGY AND CONTROL, 2016, 45 (02): : 136 - 147
  • [2] A dynamic SOM algorithm for clustering large-scale document collection
    Luo, Kegang
    Liu, Yuanchao
    Wang, Xiaolong
    [J]. ALPIT 2007: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON ADVANCED LANGUAGE PROCESSING AND WEB INFORMATION TECHNOLOGY, 2007, : 15 - +
  • [3] Efficient adaptive large-scale text clustering method based on genetic K-means algorithm
    Dai, Wenhua
    Jiao, Cuizhen
    He, Tingting
    [J]. RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 281 - 285
  • [4] Genetic Algorithm Based Clustering for Large-Scale Sensor Networks
    Lin, Hai
    Kong, Ruoshan
    Liu, Jiali
    [J]. CYBERNETICS AND INFORMATION TECHNOLOGIES, 2015, 15 (06) : 168 - 177
  • [5] A stratified sampling based clustering algorithm for large-scale data
    Zhao, Xingwang
    Liang, Jiye
    Dang, Chuangyin
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 416 - 428
  • [6] A Sampling-Based Graph Clustering Algorithm for Large-Scale Networks
    Zhang, Jian-Peng
    Chen, Hong-Chang
    Wang, Kai
    Zhu, Kai-Jie
    Wang, Ya-Wen
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2019, 47 (08): : 1731 - 1737
  • [7] CLUSTERING LARGE-SCALE DATA BASED ON MODIFIED AFFINITY PROPAGATION ALGORITHM
    Serdah, Ahmed M.
    Ashour, Wesam M.
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH, 2016, 6 (01) : 23 - 33
  • [8] Algorithm for large-scale clustering across multiple genomes
    Yi, Gangman
    Jung, Jaehee
    [J]. BIOINFORMATION, 2011, 7 (05) : 251 - 255
  • [9] An optimizing clustering algorithm for large-scale mobile network
    Tian, YC
    Guoi, W
    Ren, QC
    [J]. 2002 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS AND WEST SINO EXPOSITION PROCEEDINGS, VOLS 1-4, 2002, : 155 - 159
  • [10] A distributed and incremental algorithm for large-scale graph clustering
    Inoubli, Wissem
    Aridhi, Sabeur
    Mezni, Haithem
    Maddouri, Mondher
    Nguifo, Engelbert Mephu
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2022, 134 : 334 - 347