Stemming Malay Text and Its Application in Automatic Text Categorization

被引:5
|
作者
Yasukawa, Michiko [1 ]
Lim, Hui Tian [2 ]
Yokoo, Hidetoshi [1 ]
机构
[1] Gunma Univ, Grad Sch Engn, Kiryu, Gunma 3768515, Japan
[2] Gunma Univ, Dept Comp Sci, Kiryu, Gunma 3768515, Japan
来源
关键词
Malay language; stemmer; stemming; affix rule; text mining;
D O I
10.1587/transinf.E92.D.2351
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In Malay language, there are no Conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, all adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations. it is essential to use the precise words in formal speech or written texts. In Malay. to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of it root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed it new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence Of under-stemming errors, while that of the dictionaries is believed to reduce the Occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.
引用
收藏
页码:2351 / 2359
页数:9
相关论文
共 50 条
  • [41] Evaluation on Text Categorization for Mathematics Application Questions
    Yu, Liang-Chih
    Hu, Hsiao-Liang
    Lin, Wei-Hua
    [J]. 2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2013,
  • [42] An Application of Latent Semantic Analysis for Text Categorization
    Kou, G.
    Peng, Y.
    [J]. INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, 2015, 10 (03) : 357 - 369
  • [43] Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization
    Almuzaini, Huda Abdulrahman
    Azmi, Aqil M.
    [J]. IEEE ACCESS, 2020, 8 : 127913 - 127928
  • [44] DEALING WITH LANGUAGE VARIATION IN A SOURCE TEXT: MALAY SKETCHES AND ITS MALAY TRANSLATION
    Haroon, Haslina
    [J]. JOURNAL OF NUSANTARA STUDIES-JONUS, 2024, 9 (02): : 537 - 556
  • [45] The Use of Stemming in the Arabic Text and Its Impact on the Accuracy of Classification
    Atwan, Jaffar
    Wedyan, Mohammad
    Bsoul, Qusay
    Hammadeen, Ahmad
    Alturki, Ryan
    [J]. SCIENTIFIC PROGRAMMING, 2021, 2021
  • [46] A method for automatic determination of the feature vector size for text categorization
    Fragoso, Rogerio C. P.
    Pinheiro, Roberto H. W.
    Cavalcanti, George D. C.
    [J]. PROCEEDINGS OF 2016 5TH BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS 2016), 2016, : 259 - 264
  • [47] Improving text retrieval in medical collections through automatic categorization
    Vale, RF
    Ribeiro-Neto, BA
    de Lima, LRS
    Laender, AHF
    Junior, HRF
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2003, 2857 : 197 - 210
  • [48] Model for automatic text classification and categorization for image indexing and retrieval
    de Mello, Rodrigo Fernandes
    Bueno, Josiane Maria
    Senger, Luciano Jose
    Yang, Laurence T.
    [J]. 2007 INTERNATIONAL CONFERENCE ON INTELLIGENT PERVASIVE COMPUTING, PROCEEDINGS, 2007, : 333 - +
  • [49] A method for automatic text categorization using word sense disambiguation
    Montes Rendon, Azucena
    Vargas A., Rocio
    Estrada Esquivel, Hugo
    Gonzalez Serna, Juan G.
    Ruiz Ascencio, Jose
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2008, PT 2, PROCEEDINGS, 2008, 5073 : 1158 - 1169
  • [50] Automatic Chinese Text Categorization System Based on Mutual Information
    Lu, Zhimao
    Shi, Hong
    Zhang, Qi
    Yuan, Chaoyue
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS AND AUTOMATION, VOLS 1-7, CONFERENCE PROCEEDINGS, 2009, : 4986 - 4990