Stemming Malay Text and Its Application in Automatic Text Categorization

被引：5

作者：

Yasukawa, Michiko ^{[1
]}

Lim, Hui Tian ^{[2
]}

Yokoo, Hidetoshi ^{[1
]}

机构：

[1] Gunma Univ, Grad Sch Engn, Kiryu, Gunma 3768515, Japan

[2] Gunma Univ, Dept Comp Sci, Kiryu, Gunma 3768515, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2009年 / E92D卷 / 12期

关键词：

Malay language; stemmer; stemming; affix rule; text mining;

D O I：

10.1587/transinf.E92.D.2351

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In Malay language, there are no Conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, all adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations. it is essential to use the precise words in formal speech or written texts. In Malay. to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of it root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed it new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence Of under-stemming errors, while that of the dictionaries is believed to reduce the Occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.

引用

页码：2351 / 2359

页数：9

共 50 条

[1] Automatic text categorization and its application to text retrieval
Lam, W
Ruiz, M
Srinivasan, P
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1999, 11 (06) : 865 - 879
[2] A Comparative Study of Stemming Techniques on the Malay Text
Mohemad, Rosmayati
Muhait, Nazratul Naziah Mohd
Noor, Noor Maizura Mohamad
Mamat, Nur Fadilla Akma
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (12) : 133 - 139
[3] Contextual Text Categorization: An Improved Stemming Algorithm to Increase the Quality of Categorization in Arabic Text
Gadri, Said
Moussaoui, Abdelouahab
[J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2017, 14 (06) : 835 - 841
[4] A New and Efficient Stemming Technique for Arabic Text Categorization
Hadni, M.
Lachkar, A.
Alaoui Ouatik, S.
[J]. 2012 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS), 2012, : 791 - 796
[5] Stemming Impact on Arabic Text Categorization Performance: a Survey
Al-Anzi, Fawaz S.
AbuZeina, Dia
[J]. 2015 5TH INTERNATIONAL CONFERENCE ON INFORMATION & COMMUNICATION TECHNOLOGY AND ACCESSIBILITY (ICTA), 2015,
[6] Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization
Alshalabi, Hamood
Tiun, Sabrina
Omar, Nazlia
Albared, Mohammed
[J]. 4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICEEI 2013), 2013, 11 : 748 - 754
[7] Stemming versus light stemming as feature selection techniques for Arabic text categorization
Duwairi, Rehab
Al-Refai, Mohammad
Khasawneh, Natheer
[J]. 2007 INNOVATIONS IN INFORMATION TECHNOLOGIES, VOLS 1 AND 2, 2007, : 199 - 203
[8] Automatic Text Categorization using NTC
Jo, Taeho
[J]. NDT: 2009 FIRST INTERNATIONAL CONFERENCE ON NETWORKED DIGITAL TECHNOLOGIES, 2009, : 26 - 31
[9] Polya urn model and its application to text categorization
Zhang, Haibin
Wu, Xianyi
Zhou, Xueqin
[J]. STATISTICS AND ITS INTERFACE, 2019, 12 (02) : 227 - 237
[10] Automatic Text Categorization Marathi documents
Patil, Javdeep Jalindar
Bogiri, Nagaraju
[J]. 2015 INTERNATIONAL CONFERENCE ON ENERGY SYSTEMS AND APPLICATIONS, 2015, : 689 - 694

← 1 2 3 4 5 →