Fast categorisation of large document collections

被引:2
|
作者
Shanks, V [1 ]
Williams, HE [1 ]
机构
[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic 3001, Australia
关键词
document management; categorisation; feature extraction; efficiency;
D O I
10.1109/SPIRE.2001.989757
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the volume of data stored online increases, careful management of large document collections becomes increasingly important. Categorisation is one important document management technique. It has been effectively employed in the Web, where links to documents are maintained in topic or interest areas in, for example, the manually-categorised Yahoo!(1) hierarchy. The drawback of manual categorisation is that it is practical only on small numbers of documents, it is not scalable, and relies on the subjective judgement of human assessors. Automatic categorisation has been shown to be an accurate alternative to manual categorisation. In automatic categorisation, documents are processed and automatically assigned to pre-defined categories that represent an interest or topic area. We propose and investigate heuristics for fast categorisation of large collections of documents that are focused on selecting a minimal set of representative features from uncategorised documents. We show that these new heuristics are accurate-in some cases more accurate than the baseline techniques-and also permit more than three fold reductions in processing time for categorising large collections.
引用
收藏
页码:194 / 204
页数:11
相关论文
共 50 条
  • [41] A single-link method algorithm for clustering large document collections
    Kishida, K
    [J]. LIBRARY AND INFORMATION SCIENCE, 2002, (47): : 27 - 38
  • [42] Inspecting document collections
    Bohnacker, U
    Franke, J
    Mogg-Schneider, H
    Renz, I
    [J]. READING AND LEARNING, 2004, 2956 : 235 - 251
  • [43] The Categorisation of Occupation in Identified Skeletal Collections: A Source of Bias?
    Alves Cardoso, F.
    Henderson, C.
    [J]. INTERNATIONAL JOURNAL OF OSTEOARCHAEOLOGY, 2013, 23 (02) : 186 - 196
  • [44] The SVM with uneven margins and Chinese document categorisation
    Li, YY
    Shawe-Taylor, J
    [J]. PACLIC 17: LANGUAGE, INFORMATION AND COMPUTATION, PROCEEDINGS, 2003, : 216 - 227
  • [45] Fast Path Planning Through Large Collections of Safe Boxes
    Marcucci, Tobia
    Nobel, Parth
    Tedrake, Russ
    Boyd, Stephen
    [J]. IEEE TRANSACTIONS ON ROBOTICS, 2024, 40 : 3795 - 3811
  • [46] Parallel information retrieval scalability using the relational model on large document collections
    Alford, K
    Chen, JX
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, 2000, : 1705 - 1711
  • [47] Specialized neural network for learning synonyms and related concepts in large document collections
    Baranyi, P.
    Aradi, P.
    Koczy, L.T.
    Gedeon, T.D.
    [J]. International Conference on Knowledge-Based Intelligent Electronic Systems, Proceedings, KES, 1998, 1 : 206 - 212
  • [48] Specialised neural network for learning synonyms and related concepts in large document collections
    Baranyi, P
    Aradi, P
    Koczy, LT
    Gedeon, TD
    [J]. 1998 SECOND INTERNATIONAL CONFERENCE ON KNOWLEDGE-BASED INTELLIGENT ELECTRONIC SYSTEMS, KES'98 PROCEEDINGS, VOL 1, 1998, : 206 - 212
  • [49] Document length normalization using effective level of term frequency in large collections
    Karbasi, Soheila
    Boughanem, Mohand
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2006, 3936 : 72 - 83
  • [50] Document Retrieval on Repetitive Collections
    Navarro, Gonzalo
    Puglisi, Simon J.
    Siren, Jouni
    [J]. ALGORITHMS - ESA 2014, 2014, 8737 : 725 - 736