Rule-based Text Normalization for Malay Social Media Texts

被引:0
|
作者
Ariffin, Siti Noor Allia Noor [1 ]
Tiun, Sabrina [1 ]
机构
[1] Univ Kebangsaan Malaysia, Fac Informat Sci & Technol, Bangi, Selangor, Malaysia
关键词
Malay normalization; Malay text normalization; informal Malay text; Malay tweets; rule-based normalizer;
D O I
10.14569/IJACSA.2020.0111021
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Malay social media text is a text written on social media networks like Twitter. Commonly, this text comprises nonstandard words, filled with dialects, foreign languages, word abbreviations, grammatical neglect, spelling errors, and many more. It is well known that this type of text is difficult to process due to its high noise and distinct text structure. Such problems can be resolved using rigorous text normalization, which is critical before any technique can be implemented and evaluated on social media text. In this paper, an improved normalization method towards Malay social media text was proposed by converting non-standard Malay words using a rule-based model. The method normalizes common language words often used by Malaysian users, such as non-standard Malay (like dialect and slangs), Romanized Arabic, and English words. Thus, a Malay text normalizer was proposed using a set of rules that extend across different domains of natural language processing (NLP) and is expected to address the challenges of processing Malay social media text. This study implements the proposed Malay text normalizer in a Part-of-Speech (POS) tagging application to evaluate the normalizer's performance. The implementation demonstrates a substantial improvement in the POS tagging efficiency over several pre-processing stages, with an improvement of accuracy up to 31.8%. The increase of accuracy in the POS tagging indicates two main points. First, the Malay text normalizer's rules improve the performance of a Malay text normalizer on social media text. Second, our proposed Malay text normalizer has successfully improved the POS tagging percentage and demonstrates the importance of normalized preprocessing in any NLP application.
引用
收藏
页码:156 / 162
页数:7
相关论文
共 50 条
  • [1] A Rule-based Model for Normalization of SMS Text
    Khan, Osama A.
    Karim, Asim
    [J]. 2012 IEEE 24TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2012), VOL 1, 2012, : 634 - 641
  • [2] Rule-Based Model for Malay Text Sentiment Analysis
    Chekima, Khalifa
    Alfred, Rayner
    Chin, Kim On
    [J]. COMPUTATIONAL SCIENCE AND TECHNOLOGY, ICCST 2017, 2018, 488 : 172 - 185
  • [3] An Enhancement of Malay Social Media Text Normalization for Lexicon-Based Sentiment Analysis
    Abu Bakar, Muhammad Fakhrur Razi
    Idris, Norisma
    Shuib, Liyana
    [J]. PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 211 - 215
  • [4] Automatic normalization of short texts by combining statistical and rule-based techniques
    Marta R. Costa-jussà
    Rafael E. Banchs
    [J]. Language Resources and Evaluation, 2013, 47 : 179 - 193
  • [5] Automatic normalization of short texts by combining statistical and rule-based techniques
    Costa-jussa, Marta R.
    Banchs, Rafael E.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2013, 47 (01) : 179 - 193
  • [6] Applying Rule-Based Normalization to Different Types of Historical Texts-An Evaluation
    Bollmann, Marcel
    Petran, Florian
    Dipper, Stefanie
    [J]. HUMAN LANGUAGE TECHNOLOGY CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, 2014, 8387 : 166 - 177
  • [7] Text Detective: a rule-based system for gene annotation in biomedical texts
    Javier Tamames
    [J]. BMC Bioinformatics, 6
  • [8] Text Detective: a rule-based system for gene annotation in biomedical texts
    Tamames, J
    [J]. BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
  • [9] Rule-Based Agent For Social Media Sentiment Detection
    Hasbullab, Siti Salwa
    [J]. 2016 2ND INTERNATIONAL SYMPOSIUM ON AGENT, MULTI-AGENT SYSTEMS AND ROBOTICS (ISAMSR), 2016, : 128 - 132
  • [10] Using rule-based natural language processing to improve disease normalization in biomedical text
    Kang, Ning
    Singh, Bharat
    Afzal, Zubair
    van Mulligen, Erik M.
    Kors, Jan A.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (05) : 876 - 881