Text Categorization for Vietnamese Documents

被引:0
|
作者
Nguyen, Giang-Son [1 ]
Gao, Xiaoying [1 ]
Andreae, Peter [1 ]
机构
[1] Victoria Univ Wellington, Sch Engn & Comp Sci, Wellington, New Zealand
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many machine learning methods have been proposed for text categorization, but most research has applied them to English documents. Vietnamese is a different language with different features and it is not clear whether the standard methods will work on the categorization of Vietnamese documents. This paper describes morphological level document representtations that are appropriate for Vietnamese text documents and investigates the effectiveness of several standard learning algorithms including Naive Bayes, K-Nearest Neighbour (ICNN) and Support Vector Machine (SVM) with four different kernel functions. The results show that it is possible to build effective and efficient classifiers for Vietnamese text categorization using our representations and the standard algorithms, and demonstrate that the performance can be improved by using infogain for feature selection and using an external dictionary for filtering the vocabulary.
引用
收藏
页码:466 / 469
页数:4
相关论文
共 50 条
  • [41] An Approach for Similarity Vietnamese Documents Detection from English Documents
    Nguyen, Hai Thanh
    Le, Anh Duy
    Thai-Nghe, Nguyen
    Dien, Tran Thanh
    FUTURE DATA AND SECURITY ENGINEERING. BIG DATA, SECURITY AND PRIVACY, SMART CITY AND INDUSTRY 4.0 APPLICATIONS, FDSE 2022, 2022, 1688 : 574 - 587
  • [42] Neural Text Categorizer for Exclusive Text Categorization
    Jo, Taeho
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2008, 4 (02): : 77 - 86
  • [43] Contextual Text Categorization: An Improved Stemming Algorithm to Increase the Quality of Categorization in Arabic Text
    Gadri, Said
    Moussaoui, Abdelouahab
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2017, 14 (06) : 835 - 841
  • [44] Parsing Digitized Vietnamese Paper Documents
    Linh Truong Dieu
    Thuan Trong Nguyen
    Nguyen D. Vo
    Tam V. Nguyen
    Khang Nguyen
    COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2021, PT 1, 2021, 13052 : 382 - 392
  • [45] Named entity recognition in Vietnamese documents
    Tri Tran, Q.
    Thao Pham, T.X.
    Hung Ngo, Q.
    Dinh, Dien
    Collier, Nigel
    Progress in Informatics, 2007, (04): : 5 - 13
  • [46] Automatic categorization of figures in scientific documents
    Lu, Xiaonan
    Mitra, Prasenjit
    Wang, James Z.
    Giles, C. Lee
    OPENING INFORMATION HORIZONS, 2006, : 129 - +
  • [47] Categorization and keyword identification of unlabeled documents
    Kang, N
    Domeniconi, C
    Barbará, D
    FIFTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2005, : 677 - 680
  • [48] Categorization of On-line Handwritten Documents
    Saldarriaga, Sebastian Pena
    Morin, Emmanuel
    Viard-Gaudin, Christian
    PROCEEDINGS OF THE 8TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, 2008, : 95 - +
  • [49] Term expansion on the categorization of summarized documents
    Hsiao, Wen-Feng
    Chang, Te-Min
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2013, 28 (04): : 259 - 268
  • [50] Text categorization with WEKA: A survey
    Merlini, Donatella
    Rossini, Martina
    MACHINE LEARNING WITH APPLICATIONS, 2021, 4