A vector reconstruction based clustering algorithm particularly for large-scale text collection

被引:2
|
作者
Liu, Ming [1 ,2 ]
Wu, Chong [1 ]
Chen, Lei [3 ]
机构
[1] Sch Management, Harbin, Peoples R China
[2] Sch Comp Sci & Technol, Harbin, Peoples R China
[3] Beijing Normal Univ, Int Business Fac, Zhuhai, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Vector reconstruction; Large-scale text clustering; Partial tuning sub-process; Overall tuning sub-process; SELF-ORGANIZING MAPS; MUTUAL INFORMATION; WEIGHT; SELECTION; ENTROPY;
D O I
10.1016/j.neunet.2014.10.012
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Along with the fast evolvement of internet technology, internet users have to face the large amount of textual data every day. Apparently, organizing texts into categories can help users dig the useful information from large-scale text collection. Clustering is one of the most promising tools for categorizing texts due to its unsupervised characteristic. Unfortunately, most of traditional clustering algorithms lose their high qualities on large-scale text collection, which mainly attributes to the high-dimensional vector space and semantic similarity among texts. To effectively and efficiently cluster large-scale text collection, this paper puts forward a vector reconstruction based clustering algorithm. Only the features that can represent the cluster are preserved in cluster's representative vector. This algorithm alternately repeats two sub-processes until it converges. One process is partial tuning sub-process, where feature's weight is fine-tuned by iterative process similar to self-organizing-mapping (SOM) algorithm. To accelerate clustering velocity, an intersection based similarity measurement and its corresponding neuron adjustment function are proposed and implemented in this sub-process. The other process is overall tuning sub-process, where the features are reallocated among different clusters. In this sub-process, the features useless to represent the cluster are removed from cluster's representative vector. Experimental results on the three text collections (including two small-scale and one large-scale text collections) demonstrate that our algorithm obtains high-quality performances on both small-scale and large-scale text collections. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:141 / 155
页数:15
相关论文
共 50 条
  • [41] Large Scale Text Clustering Method Study Based on MapReduce
    Sun, Zhanquan
    Li, Feng
    Zhao, Yanling
    Song, Lifeng
    [J]. ADVANCES IN NEURAL NETWORKS - ISNN 2015, 2015, 9377 : 365 - 372
  • [42] Large-Scale Spectral Clustering Based on Representative Points
    Yang, Libo
    Liu, Xuemei
    Nie, Feiping
    Liu, Mingtang
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2019, 2019
  • [43] Large-Scale Image Clustering Based on Camera Fingerprints
    Lin, Xufeng
    Li, Chang-Tsun
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2017, 12 (04) : 793 - 808
  • [44] Large-scale spectral clustering based on pairwise constraints
    Semertzidis, T.
    Rafailidis, D.
    Strintzis, M. G.
    Daras, P.
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2015, 51 (05) : 616 - 624
  • [45] Graph Clustering for Large-Scale Text-Mining of Brain Imaging Studies
    Chawla, Manisha
    Mesa, Mounika
    Miyapuram, Krishna P.
    [J]. PROCEEDING OF THE THIRD INTERNATIONAL SYMPOSIUM ON WOMEN IN COMPUTING AND INFORMATICS (WCI-2015), 2015, : 163 - 168
  • [46] Adaptive Weighted Clustering Algorithm for Large-Scale Satellite Cluster Network
    Chen, Yu
    Zhang, Yong
    Chen, Shi
    [J]. Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology, 2021, 41 (11): : 1188 - 1192
  • [47] Density Peaks Clustering Algorithm for Large-scale Data Based on Divide-and-Conquer Strategy
    Wang, Yining
    [J]. 2021 3RD INTERNATIONAL CONFERENCE ON MACHINE LEARNING, BIG DATA AND BUSINESS INTELLIGENCE (MLBDBI 2021), 2021, : 416 - 419
  • [48] LSC: A Large-Scale Consensus-Based Clustering Algorithm for High-Performance FPGAs
    Singhal, Love
    Iyer, Mahesh A.
    Adya, Saurabh
    [J]. PROCEEDINGS OF THE 2017 54TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2017,
  • [49] ROCKET: A Robust Parallel Algorithm for Clustering Large-Scale Transaction Databases
    Loh, Woong-Kee
    Moon, Yang-Sae
    Ahn, Heejune
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10) : 2048 - 2051
  • [50] An Improved Affinity Propagation Clustering Algorithm for Large-scale Data Sets
    Liu, Xiaonan
    Yin, Meijuan
    Luo, Junyong
    Chen, Wuping
    [J]. 2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2013, : 894 - 899