A single-link method algorithm for clustering large document collections

被引:0
|
作者
Kishida, K [1 ]
机构
[1] Surugadai Univ, Hanno, Saitama, Japan
来源
关键词
D O I
暂无
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
In the 1960s and 1970s, techniques for clustering a set of documents, in order to improve the effectiveness or efficiency of information retrieval systems, have been widely explored. Similar attempts have recently been made by many researchers to allow the visualisation of search results, to provide browsing based search modes or to enhance performance in searching very large collections. The purpose of this paper is to develop an algorithm for hierarchical clustering that can work for very large document collections. The algorithm is based on a combination of two ideas proposed by other researchers to save time and space in the process of hierarchical clustering; (1) the use of an inverted file for reducing the number of document pairs for which a similarity degree is calculated, and (2) a procedure for constructing a dendrogram based on single-link method from similarity data recorded on disk and not the main memory. In this paper, the algorithm is experimentally applied to a document set consisting of about 10,000 bibliographic records, and the processing time is analyzed empirically. In addition, the effects of removing words frequently appearing in documents are examined. As a result, we find that removing such words enable us to greatly reduce the processing time without significant change in the resulting set of clusters. Finally, an empirical comparison between the single-link method and the single-pass algorithm (leader-follower algorithm) is attempted.
引用
收藏
页码:27 / 38
页数:12
相关论文
共 50 条
  • [1] CLUSTERING LARGE FILES OF DOCUMENTS USING SINGLE-LINK METHOD
    CROFT, WB
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1977, 28 (06): : 341 - 344
  • [2] Semi-supervised single-link clustering method
    Reddy, Y. C. A. Padmanabha
    Viswanath, P.
    Reddy, B. Eswara
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH, 2016, : 1005 - 1009
  • [3] SINGLE-LINK CHARACTERISTICS OF A MODE-SEEKING CLUSTERING ALGORITHM
    SHAFFER, E
    DUBES, R
    JAIN, AK
    [J]. PATTERN RECOGNITION, 1979, 11 (01) : 65 - 70
  • [4] SINGLE-LINK CHARACTERISTICS OF A MODE-SEEKING CLUSTERING ALGORITHM - COMMENTS
    KITTLER, J
    [J]. PATTERN RECOGNITION, 1979, 11 (01) : 71 - 73
  • [5] SLINK - OPTIMALLY EFFICIENT ALGORITHM FOR SINGLE-LINK CLUSTER METHOD
    SIBSON, R
    [J]. COMPUTER JOURNAL, 1973, 16 (01): : 30 - 34
  • [6] Fast Single-Link Clustering Method Based on Tolerance Rough Set Model
    Patra, Bidyut Kr
    Nandi, Sukumar
    [J]. ROUGH SETS, FUZZY SETS, DATA MINING AND GRANULAR COMPUTING, PROCEEDINGS, 2009, 5908 : 414 - 422
  • [7] SINGLE-LINK CLASSIFICATION FOR LARGE DATA SETS
    LEHERT, P
    HANSEN, P
    [J]. BIOMETRICS, 1978, 34 (04) : 755 - 755
  • [8] Efficient clustering of very large document collections
    Dhillon, IS
    Fan, J
    Guan, YQ
    [J]. DATA MINING FOR SCIENTIFIC AND ENGINEERING APPLICATIONS, 2001, 2 : 357 - 381
  • [9] An efficient clustering approach for large document collections
    Han, B
    Kang, LS
    Song, HZ
    [J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2005, 3584 : 240 - 247
  • [10] Hierarchical Star Clustering Algorithm for Dynamic Document Collections
    Gil-Garcia, Reynaldo
    Pons-Porrata, Aurora
    [J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2008, 5197 : 187 - 194