Performance Comparison and Optimization of Text Document Classification using k-NN and Naive Bayes Classification Techniques

被引:18
|
作者
Rasjid, Zulfany Erlisa [1 ]
Setiawan, Reina [1 ]
机构
[1] Bina Nusantara Univ, Comp Sci Dept, Jl KH Syahdan 9, Jakarta 11480, Indonesia
关键词
k-NN; Naive Bayes; Text Document Classification; Information Retrieval;
D O I
10.1016/j.procs.2017.10.017
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the current era, information is available in several different formats, such as text, image, video, audio and others. Corpus is a collection of documents in a large volume. By using Information Retrieval (IR), it is possible to obtain an unstructured information and automatic summary, classification and clustering. This research is to focus on data classification using two out of the six approaches of data classification, which is k-NN (k-Nearest Neighbors) and Naive Bayes. The text documents used is in XML format. The Corpus used in this research is downloaded from TREC Legal Track with a total of more than three thousand text documents and over twenty types of classifications. Out of the twenty types of classifications, six are chosen with the most number of text documents. The data is processed using RapidMiner software and the result shows that the optimum value for kin k-NN occurs at k=13. Using this value fork, the accruacy in average reached 55.17 percent, which is better than using Naive Bayes which is 39.01 percent. (C) 2017 The Authors. Published by Elsevier B.V.
引用
收藏
页码:107 / 112
页数:6
相关论文
共 50 条
  • [41] Accelerating k-NN Classification Algorithm Using Graphics Processing Units
    Selvaluxmiy, S.
    Kumara, T. N.
    Keerthanan, P.
    Velmakivan, R.
    Ragel, R.
    Deegalla, S.
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION FOR SUSTAINABILITY (ICIAFS): INTEROPERABLE SUSTAINABLE SMART SYSTEMS FOR NEXT GENERATION, 2016,
  • [42] A modification of the LAESA algorithm for approximated k-NN classification
    Moreno-Seco, F
    Micó, L
    Oncina, J
    [J]. PATTERN RECOGNITION LETTERS, 2003, 24 (1-3) : 47 - 53
  • [43] Fast k-NN classification for multichannel image data
    Warfield, S
    [J]. PATTERN RECOGNITION LETTERS, 1996, 17 (07) : 713 - 721
  • [44] Regional Distance-based k-NN Classification
    Aung, Swe Swe
    Nagayama, Itaru
    Tamaki, Shiro
    [J]. 2017 2ND INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATICS AND BIOMEDICAL SCIENCES (ICIIBMS), 2017, : 56 - 62
  • [45] Succinct matrix approximation and efficient k-NN classification
    Liu, Rong
    Shi, Yong
    [J]. ICDM 2007: PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 213 - +
  • [46] Sentiment Classification with PSO Based Weighted K-NN
    Aydin, Ilhan
    Baskaya, Fatma
    Salur, Mehmet Umut
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 739 - 744
  • [47] An improved FloatBoost algorithm for Naive Bayes text classification
    Liu, XM
    Yin, JW
    Dong, JX
    Ghafoor, MA
    [J]. ADVANCES IN WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, 2005, 3739 : 162 - 171
  • [48] Modifying Naive Bayes classifier for multinomial text classification
    [J]. 1600, Institute of Electrical and Electronics Engineers Inc., United States
  • [49] Research on text classification mining based on Naive Bayes
    Liu, LZ
    Zhang, CL
    Chen, JJ
    [J]. ISTM/2005: 6TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-9, CONFERENCE PROCEEDINGS, 2005, : 8521 - 8524
  • [50] Research on Archives Text Classification Based on Naive Bayes
    Liu, Peixin
    Yu, Hongzhi
    Xu, Tao
    Lan, Chuanqo
    [J]. PROCEEDINGS OF 2017 IEEE 2ND INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2017, : 187 - 190