An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis

被引:21
|
作者
Mehmood, Khawar [1 ]
Essam, Daryl [1 ]
Shafi, Kamran [1 ]
Malik, Muhammad Kamran [2 ]
机构
[1] Univ New South Wales Canberra, Sch Engn & Informat Technol SEIT, Canberra, ACT, Australia
[2] Univ Punjab, Coll Informat Technol PUCIT, Punjab Univ, Lahore, Pakistan
关键词
Machine learning; Natural language processing; Pattern recognition; Sentiment analysis; TEXT-NORMALIZATION;
D O I
10.1016/j.ipm.2020.102368
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text normalization is the task of transforming lexically variant words to their canonical forms. The importance of text normalization becomes apparent while developing natural language processing applications. This paper proposes a novel technique called Transliteration based Encoding for Roman Hindi/Urdu text Normalization (TERUN). TERUN utilizes the linguistic aspects of Roman Hindi/Urdu to transform lexically variant words to their canonical forms. It consists of three interlinked modules: transliteration based encoder, filter module and hash code ranker. The encoder generates all possible hash-codes for a single Roman Hindi/Urdu word. The next component filters the irrelevant codes, while the third module ranks the filtered hash-codes based on their relevance. The aim of this study is not only to normalize the text but to also examine its impact on text classification. Hence, baseline classification accuracies were computed on a dataset of 11,000 non-standardized Roman Hindi/Urdu sentiment analysis reviews using different machine learning algorithms. The dataset was then standardized using TERUN and other established phonetic algorithms, and the classification accuracies were recomputed. The cross-scheme comparison showed that TERUN outperformed all the phonetic algorithms and significantly reduced the error rate from the baseline. TERUN was then enhanced from a corpus specific to a corpus independent text normalization technique. To this end, a parallel corpus of 50,000 Urdu and Roman Hindi/Urdu words was manually tagged using a set of comprehensive annotation guidelines. Also, different phonetic algorithms and TERUN were intrinsically evaluated using a dataset of 20,000 lexically variant words. The results clearly showed the superiority of TERUN over well-known phonetic algorithms.
引用
收藏
页数:26
相关论文
共 50 条
  • [21] Roman Urdu Sentiment Analysis Using Pre-trained DistilBERT and XLNet
    Azhar, Nikhar
    Latif, Seemab
    2022 FIFTH INTERNATIONAL CONFERENCE OF WOMEN IN DATA SCIENCE AT PRINCE SULTAN UNIVERSITY (WIDS-PSU 2022), 2022, : 75 - 78
  • [22] Unsupervised sentiment analysis of Hindi reviews using MCDM and game model optimization techniques
    NEHA PUNETHA
    GOONJAN JAIN
    Sādhanā, 48
  • [23] Unsupervised sentiment analysis of Hindi reviews using MCDM and game model optimization techniques
    Punetha, Neha
    Jain, Goonjan
    SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2023, 48 (04):
  • [24] A Precisely Xtreme-Multi Channel Hybrid Approach for Roman Urdu Sentiment Analysis
    Mehmood, Faiza
    Ghani, Muhammad Usman
    Ibrahim, Muhammad Ali
    Shahzadi, Rehab
    Mahmood, Waqar
    Asim, Muhammad Nabeel
    IEEE ACCESS, 2020, 8 : 192740 - 192759
  • [25] Sentiment Analysis of Roman Urdu on E-Commerce Reviews Using Machine Learning
    Chandio, Bilal
    Shaikh, Asadullah
    Bakhtyar, Maheen
    Alrizq, Mesfer
    Baber, Junaid
    Sulaiman, Adel
    Rajab, Adel
    Noor, Waheed
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2022, 131 (03): : 1263 - 1287
  • [26] Attention-Based RU-BiLSTM Sentiment Analysis Model for Roman Urdu
    Chandio, Bilal Ahmed
    Imran, Ali Shariq
    Bakhtyar, Maheen
    Daudpota, Sher Muhammad
    Baber, Junaid
    APPLIED SCIENCES-BASEL, 2022, 12 (07):
  • [27] Sentiment Analysis on Roman Urdu Students' Feedback Using Enhanced Word Embedding Technique
    Noureen
    Huspi, Sharin Hazlin
    Ali, Zafar
    BAGHDAD SCIENCE JOURNAL, 2024, 21 (02) : 725 - 739
  • [28] Is there a language of sentiment? An analysis of lexical resources for sentiment analysis
    Ann Devitt
    Khurshid Ahmad
    Language Resources and Evaluation, 2013, 47 : 475 - 511
  • [29] Is there a language of sentiment? An analysis of lexical resources for sentiment analysis
    Devitt, Ann
    Ahmad, Khurshid
    LANGUAGE RESOURCES AND EVALUATION, 2013, 47 (02) : 475 - 511
  • [30] Urdu Sentiment Analysis With Deep Learning Methods
    Khan, Lal
    Amjad, Ammar
    Ashraf, Noman
    Chang, Hsien-Tsung
    Gelbukh, Alexander
    IEEE ACCESS, 2021, 9 : 97803 - 97812