An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis

被引:21
|
作者
Mehmood, Khawar [1 ]
Essam, Daryl [1 ]
Shafi, Kamran [1 ]
Malik, Muhammad Kamran [2 ]
机构
[1] Univ New South Wales Canberra, Sch Engn & Informat Technol SEIT, Canberra, ACT, Australia
[2] Univ Punjab, Coll Informat Technol PUCIT, Punjab Univ, Lahore, Pakistan
关键词
Machine learning; Natural language processing; Pattern recognition; Sentiment analysis; TEXT-NORMALIZATION;
D O I
10.1016/j.ipm.2020.102368
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text normalization is the task of transforming lexically variant words to their canonical forms. The importance of text normalization becomes apparent while developing natural language processing applications. This paper proposes a novel technique called Transliteration based Encoding for Roman Hindi/Urdu text Normalization (TERUN). TERUN utilizes the linguistic aspects of Roman Hindi/Urdu to transform lexically variant words to their canonical forms. It consists of three interlinked modules: transliteration based encoder, filter module and hash code ranker. The encoder generates all possible hash-codes for a single Roman Hindi/Urdu word. The next component filters the irrelevant codes, while the third module ranks the filtered hash-codes based on their relevance. The aim of this study is not only to normalize the text but to also examine its impact on text classification. Hence, baseline classification accuracies were computed on a dataset of 11,000 non-standardized Roman Hindi/Urdu sentiment analysis reviews using different machine learning algorithms. The dataset was then standardized using TERUN and other established phonetic algorithms, and the classification accuracies were recomputed. The cross-scheme comparison showed that TERUN outperformed all the phonetic algorithms and significantly reduced the error rate from the baseline. TERUN was then enhanced from a corpus specific to a corpus independent text normalization technique. To this end, a parallel corpus of 50,000 Urdu and Roman Hindi/Urdu words was manually tagged using a set of comprehensive annotation guidelines. Also, different phonetic algorithms and TERUN were intrinsically evaluated using a dataset of 20,000 lexically variant words. The results clearly showed the superiority of TERUN over well-known phonetic algorithms.
引用
收藏
页数:26
相关论文
共 50 条
  • [1] A clustering framework for lexical normalization of Roman Urdu
    Khan, Abdul Rafae
    Karim, Asim
    Sajjad, Hassan
    Kamiran, Faisal
    Xu, Jia
    [J]. NATURAL LANGUAGE ENGINEERING, 2022, 28 (01) : 93 - 123
  • [2] An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu Sentiment analysis on short text classification in Roman Urdu
    Rana, Toqir A.
    Shahzadi, Kiran
    Rana, Tauseef
    Arshad, Ahsan
    Tubishat, Mohammad
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [3] Sentiment Analysis for Roman Urdu
    Rafique, Ayesha
    Malik, Muhammad Kamran
    Nawaz, Zubair
    Bukhari, Faisal
    Jalbani, Akhtar Hussain
    [J]. MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2019, 38 (02) : 463 - 470
  • [4] Lexical Variation and Sentiment Analysis of Roman Urdu Sentences with Deep Neural Networks
    Manzoor, Muhammad Arslan
    Mamoon, Saqib
    Tao, Song Kei
    Zakir, Ali
    Adil, Muhammad
    Lu, Jianfeng
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (02) : 719 - 726
  • [5] Lexical variation and sentiment analysis of Roman Urdu sentences with deep neural networks
    Manzoor M.A.
    Mamoon S.
    Tao S.K.
    Zakir A.
    Adil M.
    Lu J.
    [J]. Lu, Jianfeng, 1600, Science and Information Organization : 719 - 726
  • [6] A Roman Urdu Corpus for sentiment analysis
    Khan, Marwa
    Naseer, Asma
    Wali, Aamir
    Tamoor, Maria
    [J]. Computer Journal, 2024, 67 (09): : 2864 - 2876
  • [7] A Roman Urdu Corpus for sentiment analysis
    Khan, Marwa
    Naseer, Asma
    Wali, Aamir
    Tamoor, Maria
    [J]. COMPUTER JOURNAL, 2024,
  • [8] Sentiment Analysis System for Roman Urdu
    Mehmood, Khawar
    Essam, Daryl
    Shafi, Kamran
    [J]. INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 29 - 42
  • [9] RUSAS: Roman Urdu Sentiment Analysis System
    Jawad, Kazim
    Ahmad, Muhammad
    Alvi, Majdah
    Alvi, Muhammad Bux
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 79 (01): : 1463 - 1480
  • [10] A Review of Urdu Sentiment Analysis with Multilingual Perspective: A Case of Urdu and Roman Urdu Language
    Khan, Ihsan Ullah
    Khan, Aurangzeb
    Khan, Wahab
    Su'ud, Mazliham Mohd
    Alam, Muhammad Mansoor
    Subhan, Fazli
    Asghar, Muhammad Zubair
    [J]. COMPUTERS, 2022, 11 (01)