Fast categorisation of large document collections

被引:2
|
作者
Shanks, V [1 ]
Williams, HE [1 ]
机构
[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic 3001, Australia
关键词
document management; categorisation; feature extraction; efficiency;
D O I
10.1109/SPIRE.2001.989757
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the volume of data stored online increases, careful management of large document collections becomes increasingly important. Categorisation is one important document management technique. It has been effectively employed in the Web, where links to documents are maintained in topic or interest areas in, for example, the manually-categorised Yahoo!(1) hierarchy. The drawback of manual categorisation is that it is practical only on small numbers of documents, it is not scalable, and relies on the subjective judgement of human assessors. Automatic categorisation has been shown to be an accurate alternative to manual categorisation. In automatic categorisation, documents are processed and automatically assigned to pre-defined categories that represent an interest or topic area. We propose and investigate heuristics for fast categorisation of large collections of documents that are focused on selecting a minimal set of representative features from uncategorised documents. We show that these new heuristics are accurate-in some cases more accurate than the baseline techniques-and also permit more than three fold reductions in processing time for categorising large collections.
引用
收藏
页码:194 / 204
页数:11
相关论文
共 50 条
  • [1] Facilitating Understanding of Large Document Collections
    Bae, Jae Hyeon
    Xu, Weijia
    Esteva, Maria
    [J]. 11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1334 - 1338
  • [2] A fast text similarity measure for large document collections using multireference cosine and genetic algorithm
    Mohammadi, Hamid
    Khasteh, Seyed Hossein
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2020, 28 (02) : 999 - 1013
  • [3] Document categorisation server
    不详
    [J]. EXPERT SYSTEMS, 1997, 14 (03) : 156 - 156
  • [4] Feature selection for the classification of large document collections
    Brank, Janez
    Mladenic, Dunja
    Grobelnik, Marko
    Milic-Frayling, Natasa
    [J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (10) : 1562 - 1596
  • [5] Efficient clustering of very large document collections
    Dhillon, IS
    Fan, J
    Guan, YQ
    [J]. DATA MINING FOR SCIENTIFIC AND ENGINEERING APPLICATIONS, 2001, 2 : 357 - 381
  • [6] An efficient clustering approach for large document collections
    Han, B
    Kang, LS
    Song, HZ
    [J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2005, 3584 : 240 - 247
  • [7] Managing very large document collections using semantics
    GuoRen Wang
    HongJun Lu
    Ge Yu
    Bin YuBao
    [J]. Journal of Computer Science and Technology, 2003, 18 : 403 - 406
  • [8] Context grabbing: Assigning metadata in large document collections
    Hinrichs, J
    Pipek, V
    Wulf, V
    [J]. ECSCW 2005: PROCEEDINGS OF THE NINTH EUROPEAN CONFERENCE ON COMPUTER-SUPPORTED COOPERATIVE WORK, 2005, : 367 - 386
  • [9] Spotting relevant information in extremely large document collections
    Kohonen, T
    [J]. COMPUTATIONAL INTELLIGENCE: THEORY AND APPLICATIONS, 1999, 1625 : 59 - 61
  • [10] Interactive visualization for opportunistic exploration of large document collections
    Lehmann, Simon
    Schwanecke, Ulrich
    Doerner, Ralf
    [J]. INFORMATION SYSTEMS, 2010, 35 (02) : 260 - 269