A comparative study of two automatic document classification methods in a library setting

被引:17
|
作者
Pong, Joanna Yi-Hang [2 ]
Kwok, Ron Chi-Wai [1 ]
Lau, Raymond Yiu-Keung [1 ]
Hao, Jin-Xing [1 ]
Wong, Percy Ching-Chi [1 ]
机构
[1] City Univ Hong Kong, Dept Informat Syst, Kowloon, Hong Kong, Peoples R China
[2] City Univ Hong Kong, Run Run Shaw Library, Kowloon, Hong Kong, Peoples R China
关键词
automatic document classification; text categorization; machine learning; k-nearest; neighbours classifier; naive Bayes classifier; library practice;
D O I
10.1177/0165551507082592
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.
引用
收藏
页码:213 / 230
页数:18
相关论文
共 50 条
  • [1] The Problems and Methods of Automatic Text Document Classification
    Yatsko, V. A.
    [J]. AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2021, 55 (06) : 274 - 285
  • [2] The Problems and Methods of Automatic Text Document Classification
    V. A. Yatsko
    [J]. Automatic Documentation and Mathematical Linguistics, 2021, 55 : 274 - 285
  • [3] Automatic Document Classification of Digital Library via Kernel Method
    Ni, Ya-jing
    Cheng, Hui
    [J]. INTERNATIONAL CONFERENCE ON ELECTRICAL, CONTROL AND AUTOMATION ENGINEERING (ECAE 2013), 2013, : 541 - 545
  • [4] AUTOMATIC METHODS OF DOCUMENT CLASSIFICATION FOR BUILDING A MODEL OF AN OBJECT
    BUKHALEVA, EI
    ZAITSEVA, GN
    REBROVA, MP
    [J]. NAUCHNO-TEKHNICHESKAYA INFORMATSIYA SERIYA 2-INFORMATSIONNYE PROTSESSY I SISTEMY, 1978, (12): : 4 - 8
  • [5] HIERARCHIC AGGLOMERATIVE CLUSTERING METHODS FOR AUTOMATIC DOCUMENT CLASSIFICATION
    GRIFFITHS, A
    ROBINSON, LA
    WILLETT, P
    [J]. JOURNAL OF DOCUMENTATION, 1984, 40 (03) : 175 - 205
  • [6] A Comparative Study of Classification Methods for Automatic Multimodal Brain Tumor Segmentation
    El-Melegy, Moumen T.
    El-Magd, Khaled M. Abo
    Ali, Samia A.
    Hussain, Khaled F.
    Mahdy, Yousef B.
    [J]. PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON INNOVATIVE TRENDS IN COMPUTER ENGINEERING (ITCE' 2018), 2018, : 36 - 41
  • [7] COMPARATIVE STUDY OF LONG DOCUMENT CLASSIFICATION
    Wagh, Vedangi
    Khandve, Snehal
    Joshi, Isha
    Wani, Apurva
    Kale, Geetanjali
    Joshi, Raviraj
    [J]. 2021 IEEE REGION 10 CONFERENCE (TENCON 2021), 2021, : 732 - 737
  • [8] AUTOMATIC DOCUMENT CLASSIFICATION
    BORKO, H
    BERNICK, M
    [J]. JOURNAL OF THE ACM, 1963, 10 (02) : 151 - &
  • [9] Selecting Prototypes for Two Multicriteria Classification Methods: A Comparative Study
    Costa, Nathanael C.
    Brasil Filho, Amaury T.
    Coelho, Andre L. V.
    Pinheiro, Placido R.
    [J]. 2009 WORLD CONGRESS ON NATURE & BIOLOGICALLY INSPIRED COMPUTING (NABIC 2009), 2009, : 1701 - 1706
  • [10] A comparative study of citations and links in document classification
    Couto, Thierson
    Cristo, Marco
    Goncalves, Marcos Andre
    Calado, Pavel
    Ziviani, Nivio
    Moura, Edleno
    Ribeiro-Neto, Berthier
    [J]. OPENING INFORMATION HORIZONS, 2006, : 75 - +