Stemming Malay Text and Its Application in Automatic Text Categorization

被引:5
|
作者
Yasukawa, Michiko [1 ]
Lim, Hui Tian [2 ]
Yokoo, Hidetoshi [1 ]
机构
[1] Gunma Univ, Grad Sch Engn, Kiryu, Gunma 3768515, Japan
[2] Gunma Univ, Dept Comp Sci, Kiryu, Gunma 3768515, Japan
来源
关键词
Malay language; stemmer; stemming; affix rule; text mining;
D O I
10.1587/transinf.E92.D.2351
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In Malay language, there are no Conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, all adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations. it is essential to use the precise words in formal speech or written texts. In Malay. to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of it root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed it new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence Of under-stemming errors, while that of the dictionaries is believed to reduce the Occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.
引用
收藏
页码:2351 / 2359
页数:9
相关论文
共 50 条
  • [1] Automatic text categorization and its application to text retrieval
    Lam, W
    Ruiz, M
    Srinivasan, P
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1999, 11 (06) : 865 - 879
  • [2] A Comparative Study of Stemming Techniques on the Malay Text
    Mohemad, Rosmayati
    Muhait, Nazratul Naziah Mohd
    Noor, Noor Maizura Mohamad
    Mamat, Nur Fadilla Akma
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (12) : 133 - 139
  • [3] Contextual Text Categorization: An Improved Stemming Algorithm to Increase the Quality of Categorization in Arabic Text
    Gadri, Said
    Moussaoui, Abdelouahab
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2017, 14 (06) : 835 - 841
  • [4] A New and Efficient Stemming Technique for Arabic Text Categorization
    Hadni, M.
    Lachkar, A.
    Alaoui Ouatik, S.
    [J]. 2012 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS), 2012, : 791 - 796
  • [5] Stemming Impact on Arabic Text Categorization Performance: a Survey
    Al-Anzi, Fawaz S.
    AbuZeina, Dia
    [J]. 2015 5TH INTERNATIONAL CONFERENCE ON INFORMATION & COMMUNICATION TECHNOLOGY AND ACCESSIBILITY (ICTA), 2015,
  • [6] Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization
    Alshalabi, Hamood
    Tiun, Sabrina
    Omar, Nazlia
    Albared, Mohammed
    [J]. 4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICEEI 2013), 2013, 11 : 748 - 754
  • [7] Stemming versus light stemming as feature selection techniques for Arabic text categorization
    Duwairi, Rehab
    Al-Refai, Mohammad
    Khasawneh, Natheer
    [J]. 2007 INNOVATIONS IN INFORMATION TECHNOLOGIES, VOLS 1 AND 2, 2007, : 199 - 203
  • [8] Automatic Text Categorization using NTC
    Jo, Taeho
    [J]. NDT: 2009 FIRST INTERNATIONAL CONFERENCE ON NETWORKED DIGITAL TECHNOLOGIES, 2009, : 26 - 31
  • [9] Polya urn model and its application to text categorization
    Zhang, Haibin
    Wu, Xianyi
    Zhou, Xueqin
    [J]. STATISTICS AND ITS INTERFACE, 2019, 12 (02) : 227 - 237
  • [10] Automatic Text Categorization Marathi documents
    Patil, Javdeep Jalindar
    Bogiri, Nagaraju
    [J]. 2015 INTERNATIONAL CONFERENCE ON ENERGY SYSTEMS AND APPLICATIONS, 2015, : 689 - 694