Fast categorisation of large document collections

被引：2

作者：

Shanks, V ^{[1
]}

Williams, HE ^{[1
]}

机构：

[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic 3001, Australia

来源：

EIGHTH SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2001年

关键词：

document management; categorisation; feature extraction; efficiency;

D O I：

10.1109/SPIRE.2001.989757

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the volume of data stored online increases, careful management of large document collections becomes increasingly important. Categorisation is one important document management technique. It has been effectively employed in the Web, where links to documents are maintained in topic or interest areas in, for example, the manually-categorised Yahoo!(1) hierarchy. The drawback of manual categorisation is that it is practical only on small numbers of documents, it is not scalable, and relies on the subjective judgement of human assessors. Automatic categorisation has been shown to be an accurate alternative to manual categorisation. In automatic categorisation, documents are processed and automatically assigned to pre-defined categories that represent an interest or topic area. We propose and investigate heuristics for fast categorisation of large collections of documents that are focused on selecting a minimal set of representative features from uncategorised documents. We show that these new heuristics are accurate-in some cases more accurate than the baseline techniques-and also permit more than three fold reductions in processing time for categorising large collections.

引用

页码：194 / 204

页数：11

共 50 条

[41] A single-link method algorithm for clustering large document collections
Kishida, K
[J]. LIBRARY AND INFORMATION SCIENCE, 2002, (47): : 27 - 38
[42] Inspecting document collections
Bohnacker, U
Franke, J
Mogg-Schneider, H
Renz, I
[J]. READING AND LEARNING, 2004, 2956 : 235 - 251
[43] The Categorisation of Occupation in Identified Skeletal Collections: A Source of Bias?
Alves Cardoso, F.
Henderson, C.
[J]. INTERNATIONAL JOURNAL OF OSTEOARCHAEOLOGY, 2013, 23 (02) : 186 - 196
[44] The SVM with uneven margins and Chinese document categorisation
Li, YY
Shawe-Taylor, J
[J]. PACLIC 17: LANGUAGE, INFORMATION AND COMPUTATION, PROCEEDINGS, 2003, : 216 - 227
[45] Fast Path Planning Through Large Collections of Safe Boxes
Marcucci, Tobia
Nobel, Parth
Tedrake, Russ
Boyd, Stephen
[J]. IEEE TRANSACTIONS ON ROBOTICS, 2024, 40 : 3795 - 3811
[46] Parallel information retrieval scalability using the relational model on large document collections
Alford, K
Chen, JX
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, 2000, : 1705 - 1711
[47] Specialized neural network for learning synonyms and related concepts in large document collections
Baranyi, P.
Aradi, P.
Koczy, L.T.
Gedeon, T.D.
[J]. International Conference on Knowledge-Based Intelligent Electronic Systems, Proceedings, KES, 1998, 1 : 206 - 212
[48] Specialised neural network for learning synonyms and related concepts in large document collections
Baranyi, P
Aradi, P
Koczy, LT
Gedeon, TD
[J]. 1998 SECOND INTERNATIONAL CONFERENCE ON KNOWLEDGE-BASED INTELLIGENT ELECTRONIC SYSTEMS, KES'98 PROCEEDINGS, VOL 1, 1998, : 206 - 212
[49] Document length normalization using effective level of term frequency in large collections
Karbasi, Soheila
Boughanem, Mohand
[J]. ADVANCES IN INFORMATION RETRIEVAL, 2006, 3936 : 72 - 83
[50] Document Retrieval on Repetitive Collections
Navarro, Gonzalo
Puglisi, Simon J.
Siren, Jouni
[J]. ALGORITHMS - ESA 2014, 2014, 8737 : 725 - 736

← 1 2 3 4 5 →