Fast categorisation of large document collections

被引：2

作者：

Shanks, V ^{[1
]}

Williams, HE ^{[1
]}

机构：

[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic 3001, Australia

来源：

EIGHTH SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2001年

关键词：

document management; categorisation; feature extraction; efficiency;

D O I：

10.1109/SPIRE.2001.989757

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the volume of data stored online increases, careful management of large document collections becomes increasingly important. Categorisation is one important document management technique. It has been effectively employed in the Web, where links to documents are maintained in topic or interest areas in, for example, the manually-categorised Yahoo!(1) hierarchy. The drawback of manual categorisation is that it is practical only on small numbers of documents, it is not scalable, and relies on the subjective judgement of human assessors. Automatic categorisation has been shown to be an accurate alternative to manual categorisation. In automatic categorisation, documents are processed and automatically assigned to pre-defined categories that represent an interest or topic area. We propose and investigate heuristics for fast categorisation of large collections of documents that are focused on selecting a minimal set of representative features from uncategorised documents. We show that these new heuristics are accurate-in some cases more accurate than the baseline techniques-and also permit more than three fold reductions in processing time for categorising large collections.

引用

页码：194 / 204

页数：11

共 50 条

[1] Facilitating Understanding of Large Document Collections
Bae, Jae Hyeon
Xu, Weijia
Esteva, Maria
[J]. 11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1334 - 1338
[2] A fast text similarity measure for large document collections using multireference cosine and genetic algorithm
Mohammadi, Hamid
Khasteh, Seyed Hossein
[J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2020, 28 (02) : 999 - 1013
[3] Document categorisation server
不详
[J]. EXPERT SYSTEMS, 1997, 14 (03) : 156 - 156
[4] Feature selection for the classification of large document collections
Brank, Janez
Mladenic, Dunja
Grobelnik, Marko
Milic-Frayling, Natasa
[J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (10) : 1562 - 1596
[5] Efficient clustering of very large document collections
Dhillon, IS
Fan, J
Guan, YQ
[J]. DATA MINING FOR SCIENTIFIC AND ENGINEERING APPLICATIONS, 2001, 2 : 357 - 381
[6] An efficient clustering approach for large document collections
Han, B
Kang, LS
Song, HZ
[J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2005, 3584 : 240 - 247
[7] Managing very large document collections using semantics
GuoRen Wang
HongJun Lu
Ge Yu
Bin YuBao
[J]. Journal of Computer Science and Technology, 2003, 18 : 403 - 406
[8] Context grabbing: Assigning metadata in large document collections
Hinrichs, J
Pipek, V
Wulf, V
[J]. ECSCW 2005: PROCEEDINGS OF THE NINTH EUROPEAN CONFERENCE ON COMPUTER-SUPPORTED COOPERATIVE WORK, 2005, : 367 - 386
[9] Spotting relevant information in extremely large document collections
Kohonen, T
[J]. COMPUTATIONAL INTELLIGENCE: THEORY AND APPLICATIONS, 1999, 1625 : 59 - 61
[10] Interactive visualization for opportunistic exploration of large document collections
Lehmann, Simon
Schwanecke, Ulrich
Doerner, Ralf
[J]. INFORMATION SYSTEMS, 2010, 35 (02) : 260 - 269

← 1 2 3 4 5 →