Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

被引:0
|
作者
Trandafili, Evis [1 ]
Kote, Nelda [2 ]
Biba, Marenglen [3 ]
机构
[1] Polytech Univ Tirana, Fac Informat Technol, Dept Comp Engn, Tirana, Albania
[2] Polytech Univ Tirana, Fac Informat Technol, Dept Fundamentals Comp Sci, Tirana, Albania
[3] New York Univ Tirana, Fac Informat Technol, Dept Comp Sci, Tirana, Albania
关键词
D O I
10.1007/978-3-319-75928-9_48
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text mining and natural language processing are gaining significant role in our daily life as information volumes increase steadily. Most of the digital information is unstructured in the form of raw text. While for several languages there is extensive research on mining and language processing, much less work has been performed for other languages. In this paper we aim to evaluate the performance of some of the most important text classification algorithms over a corpus composed of Albanian texts. After applying natural language preprocessing steps, we apply several algorithms such as Simple Logistics, Naive Bayes, k-Nearest Neighbor, Decision Trees, Random Forest, Support Vector Machines and Neural Networks. The experiments show that Naive Bayes and Support Vector Machines perform best in classifying Albanian corpuses. Furthermore, Simple Logistics algorithm also shows good results.
引用
收藏
页码:537 / 547
页数:11
相关论文
共 50 条
  • [1] Text categorization algorithms using semantic approaches, corpus-based thesaurus and Word Net
    Li, Cheng Hua
    Yang, Ju Cheng
    Park, Soon Cheol
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (01) : 765 - 772
  • [2] Using the Web as corpus for self-training text categorization
    Rafael Guzmán-Cabrera
    Manuel Montes-y-Gómez
    Paolo Rosso
    Luis Villaseñor-Pineda
    [J]. Information Retrieval, 2009, 12 : 400 - 415
  • [3] Using corpus statistics to remove redundant words in text categorization
    Yang, YM
    Wilbur, J
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1996, 47 (05): : 357 - 369
  • [4] Using the Web as corpus for self-training text categorization
    Guzman-Cabrera, Rafael
    Montes-y-Gomez, Manuel
    Rosso, Paolo
    Villasenor-Pineda, Luis
    [J]. INFORMATION RETRIEVAL, 2009, 12 (03): : 400 - 415
  • [5] An Experimental Evaluation of Algorithms for Opinion Mining in Multi-domain Corpus in Albanian
    Kote, Nelda
    Biba, Marenglen
    Trandafili, Evis
    [J]. FOUNDATIONS OF INTELLIGENT SYSTEMS (ISMIS 2018), 2018, 11177 : 439 - 447
  • [6] Performance, evaluation and prediction of weather and cyclone categorization using various algorithms
    Karthick, S.
    Malathi, D.
    Sudarsan, J. S.
    Nithiyanantham, S.
    [J]. MODELING EARTH SYSTEMS AND ENVIRONMENT, 2021, 7 (03) : 1703 - 1711
  • [7] Comparison of Text Categorization Algorithms
    SHI Yong-feng
    [J]. Wuhan University Journal of Natural Sciences, 2004, (05) : 798 - 804
  • [8] Performance, evaluation and prediction of weather and cyclone categorization using various algorithms
    S. Karthick
    D. Malathi
    J. S. Sudarsan
    S. Nithiyanantham
    [J]. Modeling Earth Systems and Environment, 2021, 7 : 1703 - 1711
  • [9] Automated essay assessment system using text categorization algorithms
    Tahani, H
    Pino, JA
    [J]. MLMTA'03: INTERNATIONAL CONFERENCE ON MACHINE LEARNING; MODELS, TECHNOLOGIES AND APPLICATIONS, 2003, : 102 - 107
  • [10] New boosting algorithms for text categorization
    Diao, LL
    Lu, MY
    Hu, KY
    Lu, YC
    Shi, CY
    [J]. PROCEEDINGS OF THE 4TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION, VOLS 1-4, 2002, : 2326 - 2329