N-grams and morphological normalization in text classification: A comparison on a Croatian-English parallel corpus

被引:0
|
作者
Silic, Artur [1 ]
Chauchat, Jean-Hugues [2 ]
Basic, Bojana Dalbelo [1 ]
Morin, Annie [3 ]
机构
[1] Univ Zagreb, Dept Elect Microelect Comp & Intelligent Syst, KTLab, Unska 3, Zagreb 1000, Croatia
[2] Univ Lyon 2, Fac Sci Econ Gestion, Lab Eric, F-69676 Bron, France
[3] Univ Rennes 1, IRISA, F-35042 Rennes, France
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.
引用
收藏
页码:671 / +
页数:4
相关论文
共 18 条
  • [1] Language morphology offset: Text classification on a Croatian-English parallel corpus
    Malenica, M.
    Smuc, T.
    Snajder, J.
    Basic, B. Dalbelo
    INFORMATION PROCESSING & MANAGEMENT, 2008, 44 (01) : 325 - 339
  • [2] Sentence Classification Using N-Grams in Urdu Language Text
    Awan, Malik Daler Ali
    Ali, Sikandar
    Samad, Ali
    Iqbal, Nadeem
    Missen, Malik Muhammad Saad
    Ullah, Niamat
    SCIENTIFIC PROGRAMMING, 2021, 2021
  • [3] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [4] N-grams based feature selection and text representation for Chinese text classification
    Department of Computer Science and Engineering, Tongji University, Cao'an Road, 4800, Shanghai, 201804, China
    不详
    不详
    Int. J. Comput. Intell. Syst., 2009, 4 (365-374):
  • [5] N-grams based feature selection and text representation for Chinese Text Classification
    Zhihua Wei
    Duoqian Miao
    Jean Hugues Chauchat
    Rui Zhao
    Wen Li
    International Journal of Computational Intelligence Systems, 2009, 2 (4) : 365 - 374
  • [6] N-grams based feature selection and text representation for Chinese Text Classification
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhao, Rui
    Li, Wen
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2009, 2 (04) : 365 - 374
  • [7] Text classification and multilinguism: Getting at words via N-grams of characters
    Biskri, I
    Delisle, S
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL V, PROCEEDINGS: COMPUTER SCI I, 2002, : 110 - 115
  • [8] Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm
    Saloun, Petr
    Andrsic, David
    Cigankova, Barbora
    Anagnostopoulos, Ioannis
    2020 15TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP 2020), 2020, : 162 - 167
  • [9] Feature selection on Chinese text classification using character n-grams
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhong, Caiming
    ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
  • [10] Using of n-grams from morphological tags for fake news classification
    Kapusta, Jozef
    Drlik, Martin
    Munk, Michal
    PEERJ COMPUTER SCIENCE, 2021, 7