Performance Comparison and Optimization of Text Document Classification using k-NN and Naive Bayes Classification Techniques

被引:18
|
作者
Rasjid, Zulfany Erlisa [1 ]
Setiawan, Reina [1 ]
机构
[1] Bina Nusantara Univ, Comp Sci Dept, Jl KH Syahdan 9, Jakarta 11480, Indonesia
关键词
k-NN; Naive Bayes; Text Document Classification; Information Retrieval;
D O I
10.1016/j.procs.2017.10.017
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the current era, information is available in several different formats, such as text, image, video, audio and others. Corpus is a collection of documents in a large volume. By using Information Retrieval (IR), it is possible to obtain an unstructured information and automatic summary, classification and clustering. This research is to focus on data classification using two out of the six approaches of data classification, which is k-NN (k-Nearest Neighbors) and Naive Bayes. The text documents used is in XML format. The Corpus used in this research is downloaded from TREC Legal Track with a total of more than three thousand text documents and over twenty types of classifications. Out of the twenty types of classifications, six are chosen with the most number of text documents. The data is processed using RapidMiner software and the result shows that the optimum value for kin k-NN occurs at k=13. Using this value fork, the accruacy in average reached 55.17 percent, which is better than using Naive Bayes which is 39.01 percent. (C) 2017 The Authors. Published by Elsevier B.V.
引用
收藏
页码:107 / 112
页数:6
相关论文
共 50 条
  • [1] Techniques for improving the performance of naive Bayes for text classification
    Schneider, KM
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 682 - 693
  • [2] Selection of Relevant Features for Text Classification with K-NN
    Balicki, Jerzy
    Krawczyk, Henryk
    Rymko, Lukasz
    Szymanski, Julian
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT II, 2013, 7895 : 477 - 488
  • [3] Improving the k-NN and applying it to Chinese text classification
    Yuan, F
    Yang, L
    Yu, G
    [J]. Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 1547 - 1553
  • [4] A Comparative Study of Naive Bayes and k-NN Algorithm for Multi-class Drug Molecule Classification
    Mandal, Lakshmi
    Jana, Nanda Dulal
    [J]. 2019 IEEE 16TH INDIA COUNCIL INTERNATIONAL CONFERENCE (IEEE INDICON 2019), 2019,
  • [5] Some effective techniques for naive Bayes text classification
    Kim, Sang-Bum
    Han, Kyoung-Soo
    Rim, Hae-Chang
    Myaeng, Sung Hyon
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (11) : 1457 - 1466
  • [6] Classification of Targets in SAR Images Using SVM and k-NN Techniques
    Demirhan, Mahmut Esat
    Salor, Ozgul
    [J]. 2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1581 - 1584
  • [7] Topic document model approach for naive Bayes text classification
    Kim, SB
    Rim, HC
    Kim, JD
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (05): : 1091 - 1094
  • [8] The Improved Text Classification Method Based on Bayesian and k-NN
    Tao, Wang
    Liang, Huo
    Liu, Yang
    [J]. PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE OF MODERN COMPUTER SCIENCE AND APPLICATIONS, 2013, 191 : 57 - +
  • [9] <bold>AN OPTIMIZATION ALGORITHM OF K-NN CLASSIFICATION</bold>
    Zhan, Yan
    Chen, Hao
    Zhang, Guo-Chun
    [J]. PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2006, : 2246 - +
  • [10] Comparison of Color Identification on Soccer Robot using Color Filtering, k-NN and Naive Bayes
    Suyono, Hadi
    Setyawati, Onny
    Amri, Syaiful
    [J]. 2018 2ND INTERNATIONAL CONFERENCE ON APPLIED ELECTROMAGNETIC TECHNOLOGY (AEMT), 2018, : 57 - 60