Self organization of a massive document collection

被引:521
|
作者
Kohonen, T [1 ]
Kaski, S [1 ]
Lagus, K [1 ]
Salojärvi, J [1 ]
Honkela, J [1 ]
Paatero, V [1 ]
Saarela, A [1 ]
机构
[1] Aalto Univ, Neural Networks Res Ctr, FIN-02150 Espoo, Finland
来源
IEEE TRANSACTIONS ON NEURAL NETWORKS | 2000年 / 11卷 / 03期
基金
芬兰科学院;
关键词
data mining; exploratory data analysis; knowledge discovery; large databases; parallel implementation; random projection; self-organizing map (SOM); textual documents;
D O I
10.1109/72.846729
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the self-organizing map (SOM) algorithm. As the feature vectors for the documents statistical representations of their vocabularies are used. The main goal in our work: has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6 840 568 patent abstracts onto a 1 002 240-node SOM, As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.
引用
收藏
页码:574 / 585
页数:12
相关论文
共 50 条
  • [41] The Beaumarchais document collection at the Comedie-Francaise
    Laplace, R
    REVUE D HISTOIRE DU THEATRE, 1999, 51 (01): : 71 - 78
  • [42] Probabilistic data fusion on a large document collection
    Lillis, David
    Toolan, Fergus
    Collier, Rem
    Dunnion, John
    ARTIFICIAL INTELLIGENCE REVIEW, 2006, 26 (1-2) : 23 - 34
  • [43] A NEW DOCUMENT ON THE CHOISEUL-GOUFFIER COLLECTION
    Queyrel, M. Francois
    COMPTES RENDUS DES SEANCES DE L ACADEMIE DES INSCRIPTIONS & BELLES-LETTRES, 2007, (02): : 1143 - 1159
  • [44] DOCUMENT FROM THE REIGN OF CARACALLA IN THE MICHIGAN COLLECTION
    GIGNAC, FT
    BULLETIN OF THE AMERICAN SOCIETY OF PAPYROLOGISTS, 1976, 13 (03) : 93 - &
  • [45] American Revolution: The Definitive Encyclopedia and Document Collection
    Lothrop, Patricia D.
    LIBRARY JOURNAL, 2018, 143 (21) : 97 - 97
  • [46] Cooperative collection development and document delivery in Hungary
    Viragos, M
    HEALTH INFORMATION MANAGEMENT: WHAT STRATEGIES?, 1997, : 297 - 297
  • [47] Collection selection for managed distributed document databases
    D'Souza, D
    Thom, JA
    Zobel, J
    INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (03) : 527 - 546
  • [48] Collection statistics for fast duplicate document detection
    Chowdhury, A
    Frieder, O
    Grossman, D
    McCabe, MC
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (02) : 171 - 191
  • [49] Modern Genocide: The Definitive Resource and Document Collection
    Coutts, Brian E.
    Etkin, Cynthia
    LaGuardia, Cheryl
    Swoger, Bonnie J. M.
    LIBRARY JOURNAL, 2016, 141 (04) : 54 - 54
  • [50] Probabilistic data fusion on a large document collection
    David Lillis
    Fergus Toolan
    Rem Collier
    John Dunnion
    Artificial Intelligence Review, 2006, 26 : 23 - 34