Automated classification of textual documents based on a controlled vocabulary in engineering

被引:10
|
作者
Golub, Koralika [1 ]
Hamon, Thierry [2 ]
Ardo, Anders [1 ]
机构
[1] Lund Univ, KnowLib Res Grp, SE-22100 Lund, Sweden
[2] Univ Paris 13, Lab Informat Paris Nord, CNRS, UMR 7030,Inst Galilee, F-93430 Villetaneuse, France
来源
KNOWLEDGE ORGANIZATION | 2007年 / 34卷 / 04期
关键词
D O I
10.5771/0943-7444-2007-4-247
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Automated subject classification has been a challenging research issue for many years now, receiving particular attention in the past decade due to rapid increase of digital documents. The most frequent approach to automated classification is machine learning. It, however, requires training documents and performs well on new documents only if these are similar enough to the former. We explore a string-matching algorithm based on a controlled vocabulary, which does not require training documents-instead it reuses the intellectual work put into creating the controlled vocabulary. Terms from the Engineering Information thesaurus and classification scheme were matched against title and abstract of engineering papers from the Compendex database. Simple string-matching was enhanced by several methods such as term weighting schemes and cut-offs, exclusion of certain terms, and enrichment of the controlled vocabulary with automatically extracted terms. The best results are 76% recall when the controlled vocabulary is enriched with new terms, and 79% precision when certain terms are excluded. Precision of individual classes is up to 98%. These results are comparable to state-of-the-art-machine-learning algorithms.
引用
收藏
页码:247 / 263
页数:17
相关论文
共 50 条
  • [1] Automated subject classification of textual web documents
    Golub, Koraljka
    [J]. JOURNAL OF DOCUMENTATION, 2006, 62 (03) : 350 - 371
  • [2] Automated Subject Classification of Textual Documents in the Context of Web-Based Hierarchical Browsing
    Golub, Koraljka
    [J]. KNOWLEDGE ORGANIZATION, 2011, 38 (03): : 230 - 244
  • [3] Distributed classification of textual documents on the Grid
    Janciak, Ivan
    Sarnovsky, Martin
    Tjoa, A. Min
    Brezany, Peter
    [J]. HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2006, 4208 : 710 - 718
  • [4] Improving retrieval effectiveness by reranking documents based on controlled vocabulary
    Kamps, J
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2004, 2997 : 283 - 295
  • [5] Automated Geocoding of Textual Documents: A Survey of Current Approaches
    Melo, Fernando
    Martins, Bruno
    [J]. TRANSACTIONS IN GIS, 2017, 21 (01) : 3 - 38
  • [6] Content Linguistic Analysis Methods for Textual Documents Classification
    Lytvyn, Vasyl
    Vysotska, Victoria
    Veres, Oleh
    Rishnyak, Ihor
    Rishnyak, Halya
    [J]. 2016 XITH INTERNATIONAL SCIENTIFIC AND TECHNICAL CONFERENCE COMPUTER SCIENCES AND INFORMATION TECHNOLOGIES (CSIT), 2016, : 190 - 192
  • [7] Behaviors of Reservoir Computing Models for Textual Documents Classification
    Schaetti, Nils
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [8] Classification of Untranscribed Handwritten Notarial Documents by Textual Contents
    Jose Flores, Juan
    Ramon Prieto, Jose
    Garrido, David
    Alonso, Carlos
    Vidal, Enrique
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2022), 2022, 13256 : 14 - 26
  • [9] A proposal for annotation, semantic similarity and classification of textual documents
    Nauer, Emmanuel
    Napoli, Amedeo
    [J]. ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2006, 4183 : 201 - 212
  • [10] Automatic Classification of Research Documents using Textual Entailment
    Ojokoh, Bolanle Adefowoke
    Omisore, Olatunji Mumini
    Samuel, Oluwarotimi Williams
    [J]. PROCEEDINGS OF THE 15TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL'15), 2015, : 251 - 252