Self organization of a massive document collection

被引:521
|
作者
Kohonen, T [1 ]
Kaski, S [1 ]
Lagus, K [1 ]
Salojärvi, J [1 ]
Honkela, J [1 ]
Paatero, V [1 ]
Saarela, A [1 ]
机构
[1] Aalto Univ, Neural Networks Res Ctr, FIN-02150 Espoo, Finland
来源
IEEE TRANSACTIONS ON NEURAL NETWORKS | 2000年 / 11卷 / 03期
基金
芬兰科学院;
关键词
data mining; exploratory data analysis; knowledge discovery; large databases; parallel implementation; random projection; self-organizing map (SOM); textual documents;
D O I
10.1109/72.846729
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the self-organizing map (SOM) algorithm. As the feature vectors for the documents statistical representations of their vocabularies are used. The main goal in our work: has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6 840 568 patent abstracts onto a 1 002 240-node SOM, As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.
引用
收藏
页码:574 / 585
页数:12
相关论文
共 50 条
  • [1] Self organization of a massive text document collection
    Kohonen, T
    Kaski, S
    Lagus, K
    Salojärvi, J
    Honkela, J
    Paatero, V
    Saarela, A
    KOHONEN MAPS, 1999, : 171 - 182
  • [2] Self-organization of distributed document archives
    Merkl, Dieter
    Rauber, Andreas
    Proceedings of the International Database Engineering and Applications Symposium, IDEAS, 1999, : 128 - 136
  • [3] Self-organizing maps of massive document collections
    Kohonen, T
    IJCNN 2000: PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOL II, 2000, : 3 - 9
  • [4] CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
    El-Kishky, Ahmed
    Chaudhary, Vishrav
    Guzman, Francisco
    Koehn, Philipp
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5960 - 5969
  • [5] Improving self-organization of document collections by semantic mapping
    Correa, Renato Fernandes
    Ludermir, Teresa Bernarda
    NEUROCOMPUTING, 2006, 70 (1-3) : 62 - 69
  • [6] Hybrid neural document clustering using guided self-organization and wordnet
    Hung, CL
    Wermter, S
    Smith, P
    IEEE INTELLIGENT SYSTEMS, 2004, 19 (02) : 68 - 77
  • [7] VANDYCK COLLECTION - A DOCUMENT REDISCOVERED
    BROWN, C
    RACAR-REVUE D ART CANADIENNE-CANADIAN ART REVIEW, 1983, 10 (01): : 69 - 72
  • [8] The Holocaust: An Encyclopedia and Document Collection
    Wiebe, Todd J.
    REFERENCE & USER SERVICES QUARTERLY, 2019, 59 (01) : 85 - 85
  • [9] Finding hotspots in document collection
    Peng, Wei
    Ding, Chris
    Li, Tao
    Sun, Tong
    19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL I, PROCEEDINGS, 2007, : 313 - +
  • [10] Collection-Document Summaries
    Witt, Nils
    Granitzer, Michael
    Seifert, Christin
    ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 : 638 - 643