Text Categorization for Vietnamese Documents

被引:0
|
作者
Nguyen, Giang-Son [1 ]
Gao, Xiaoying [1 ]
Andreae, Peter [1 ]
机构
[1] Victoria Univ Wellington, Sch Engn & Comp Sci, Wellington, New Zealand
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many machine learning methods have been proposed for text categorization, but most research has applied them to English documents. Vietnamese is a different language with different features and it is not clear whether the standard methods will work on the categorization of Vietnamese documents. This paper describes morphological level document representtations that are appropriate for Vietnamese text documents and investigates the effectiveness of several standard learning algorithms including Naive Bayes, K-Nearest Neighbour (ICNN) and Support Vector Machine (SVM) with four different kernel functions. The results show that it is possible to build effective and efficient classifiers for Vietnamese text categorization using our representations and the standard algorithms, and demonstrate that the performance can be improved by using infogain for feature selection and using an external dictionary for filtering the vocabulary.
引用
收藏
页码:466 / 469
页数:4
相关论文
共 50 条
  • [21] Simultaneous categorization of text documents and identification of cluster-dependent keywords
    Frigui, H
    Nasraoui, F
    PROCEEDINGS OF THE 2002 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOL 1 & 2, 2002, : 1108 - 1113
  • [22] Categorizing paper documents - A generic system for domain and language independent text categorization
    Bayer, T
    Kressel, U
    Mogg-Schneider, H
    Renz, I
    COMPUTER VISION AND IMAGE UNDERSTANDING, 1998, 70 (03) : 299 - 306
  • [23] Semi-supervised Text Categorization with Only a Few Positive and Unlabeled Documents
    Lu, Fang
    Bai, Qingyuan
    2010 3RD INTERNATIONAL CONFERENCE ON BIOMEDICAL ENGINEERING AND INFORMATICS (BMEI 2010), VOLS 1-7, 2010, : 3075 - 3079
  • [24] Large-Scale Linguistic Ontology as a Basis for Text Categorization of Legislative Documents
    Loukachevitch, Natalia
    Dobrov, Boris
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2005, 134 : 109 - 110
  • [25] Text Categorization for Assessing Multiple Documents Integration, or John Henry Visits a Data Mine
    Hastings, Peter
    Hughes, Simon
    Magliano, Joe
    Goldman, Susan
    Lawless, Kim
    ARTIFICIAL INTELLIGENCE IN EDUCATION, 2011, 6738 : 115 - 122
  • [26] Rough set feature selection methods for case-based categorization of text documents
    Gupta, KM
    Moore, PG
    Aha, DW
    Pal, SK
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 792 - 798
  • [27] Vocabulary completion through word cooccurrence analysis using unlabeled documents for text categorization
    Réhel, S
    Mineau, GW
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, 3501 : 377 - 388
  • [28] Text categorization for multi-page documents: A hybrid Naive Bayes HMM approach
    Frasconi, Paolo
    Soda, Giovanni
    Vullo, Alessandro
    Proceedings of the ACM International Conference on Digital Libraries, 2001, : 11 - 20
  • [29] Multi-attribute Classification of Text Documents as a Tool for Ranking and Categorization of Educational Innovation Projects
    An, Alexey
    Dauletbakov, Bakytkan
    Levner, Eugene
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PART II, 2014, 8404 : 404 - 416
  • [30] Paraphrase Identification in Vietnamese Documents
    Bach, Ngo Xuan
    Oanh, Tran Thi
    Hai, Nguyen Trung
    Phuong, Tu Minh
    2015 SEVENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2015, : 174 - 179