A vector reconstruction based clustering algorithm particularly for large-scale text collection

被引:2
|
作者
Liu, Ming [1 ,2 ]
Wu, Chong [1 ]
Chen, Lei [3 ]
机构
[1] Sch Management, Harbin, Peoples R China
[2] Sch Comp Sci & Technol, Harbin, Peoples R China
[3] Beijing Normal Univ, Int Business Fac, Zhuhai, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Vector reconstruction; Large-scale text clustering; Partial tuning sub-process; Overall tuning sub-process; SELF-ORGANIZING MAPS; MUTUAL INFORMATION; WEIGHT; SELECTION; ENTROPY;
D O I
10.1016/j.neunet.2014.10.012
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Along with the fast evolvement of internet technology, internet users have to face the large amount of textual data every day. Apparently, organizing texts into categories can help users dig the useful information from large-scale text collection. Clustering is one of the most promising tools for categorizing texts due to its unsupervised characteristic. Unfortunately, most of traditional clustering algorithms lose their high qualities on large-scale text collection, which mainly attributes to the high-dimensional vector space and semantic similarity among texts. To effectively and efficiently cluster large-scale text collection, this paper puts forward a vector reconstruction based clustering algorithm. Only the features that can represent the cluster are preserved in cluster's representative vector. This algorithm alternately repeats two sub-processes until it converges. One process is partial tuning sub-process, where feature's weight is fine-tuned by iterative process similar to self-organizing-mapping (SOM) algorithm. To accelerate clustering velocity, an intersection based similarity measurement and its corresponding neuron adjustment function are proposed and implemented in this sub-process. The other process is overall tuning sub-process, where the features are reallocated among different clusters. In this sub-process, the features useless to represent the cluster are removed from cluster's representative vector. Experimental results on the three text collections (including two small-scale and one large-scale text collections) demonstrate that our algorithm obtains high-quality performances on both small-scale and large-scale text collections. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:141 / 155
页数:15
相关论文
共 50 条
  • [31] Fast and scalable support vector clustering for large-scale data analysis
    Yuan Ping
    Yun Feng Chang
    Yajian Zhou
    Ying Jie Tian
    Yi Xian Yang
    Zhili Zhang
    [J]. Knowledge and Information Systems, 2015, 43 : 281 - 310
  • [32] Fast and scalable support vector clustering for large-scale data analysis
    Ping, Yuan
    Chang, Yun Feng
    Zhou, Yajian
    Tian, Ying Jie
    Yang, Yi Xian
    Zhang, Zhili
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 43 (02) : 281 - 310
  • [33] Structure-based Clustering Algorithm for Model Reduction of Large-scale Network Systems
    Niazi, Muhammad Umar B.
    Chen, Xiaodong
    Canudas-de-Wit, Carlos
    Scherpen, Jacquelien M. A.
    [J]. 2019 IEEE 58TH CONFERENCE ON DECISION AND CONTROL (CDC), 2019, : 5038 - 5043
  • [34] A Spark-based Artificial Bee Colony Algorithm for Large-scale Data Clustering
    Wang, Yanjie
    Qian, Quan
    [J]. IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 1213 - 1218
  • [35] CASS: A distributed network clustering algorithm based on structure similarity for large-scale network
    Kim, Jungrim
    Shin, Mincheol
    Kim, Jeongwoo
    Park, Chihyun
    Lee, Sujin
    Woo, Jaemin
    Kim, Hyerim
    Seo, Dongmin
    Yu, Seokjong
    Park, Sanghyun
    [J]. PLOS ONE, 2018, 13 (10):
  • [36] A virtual circle-based clustering algorithm with mobility prediction in large-scale MANETs
    Wang, GJ
    Zhang, LF
    Cao, JN
    [J]. NETWORKING AND MOBILE COMPUTING, PROCEEDINGS, 2005, 3619 : 364 - 374
  • [37] Large-scale distributed PV cluster division based on Fast Unfolding clustering algorithm
    Wang, Lei
    Zhang, Fan
    Kou, Lingfeng
    Xu, Yihu
    Hou, Xiaogang
    [J]. Taiyangneng Xuebao/Acta Energiae Solaris Sinica, 2021, 42 (10): : 29 - 34
  • [38] ACURDION: An Adaptive Clustering-based Algorithm for Tracing Large-scale MPI Applications
    Bahmani, Amir
    Mueller, Frank
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 785 - 792
  • [39] W-Hash: A Novel Word Hash Clustering Algorithm for Large-Scale Chinese Short Text Analysis
    Chen, Yaofeng
    Zhang, Chunyang
    Ye, Long
    Peng, Xiaogang
    Qiu, Meikang
    Cao, Weipeng
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2022, PT III, 2022, 13370 : 528 - 539
  • [40] Large-Scale Urban Reconstruction with Tensor Clustering and Global Boundary Refinement
    Poullis, Charalambos
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (05) : 1132 - 1145