An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis

被引:21
|
作者
Mehmood, Khawar [1 ]
Essam, Daryl [1 ]
Shafi, Kamran [1 ]
Malik, Muhammad Kamran [2 ]
机构
[1] Univ New South Wales Canberra, Sch Engn & Informat Technol SEIT, Canberra, ACT, Australia
[2] Univ Punjab, Coll Informat Technol PUCIT, Punjab Univ, Lahore, Pakistan
关键词
Machine learning; Natural language processing; Pattern recognition; Sentiment analysis; TEXT-NORMALIZATION;
D O I
10.1016/j.ipm.2020.102368
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text normalization is the task of transforming lexically variant words to their canonical forms. The importance of text normalization becomes apparent while developing natural language processing applications. This paper proposes a novel technique called Transliteration based Encoding for Roman Hindi/Urdu text Normalization (TERUN). TERUN utilizes the linguistic aspects of Roman Hindi/Urdu to transform lexically variant words to their canonical forms. It consists of three interlinked modules: transliteration based encoder, filter module and hash code ranker. The encoder generates all possible hash-codes for a single Roman Hindi/Urdu word. The next component filters the irrelevant codes, while the third module ranks the filtered hash-codes based on their relevance. The aim of this study is not only to normalize the text but to also examine its impact on text classification. Hence, baseline classification accuracies were computed on a dataset of 11,000 non-standardized Roman Hindi/Urdu sentiment analysis reviews using different machine learning algorithms. The dataset was then standardized using TERUN and other established phonetic algorithms, and the classification accuracies were recomputed. The cross-scheme comparison showed that TERUN outperformed all the phonetic algorithms and significantly reduced the error rate from the baseline. TERUN was then enhanced from a corpus specific to a corpus independent text normalization technique. To this end, a parallel corpus of 50,000 Urdu and Roman Hindi/Urdu words was manually tagged using a set of comprehensive annotation guidelines. Also, different phonetic algorithms and TERUN were intrinsically evaluated using a dataset of 20,000 lexically variant words. The results clearly showed the superiority of TERUN over well-known phonetic algorithms.
引用
收藏
页数:26
相关论文
共 50 条
  • [31] A Practical Approach to Sentiment Analysis of Hindi Tweets
    Sharma, Yakshi
    Mangat, Veenu
    Kaur, Mandeep
    [J]. 2015 1ST INTERNATIONAL CONFERENCE ON NEXT GENERATION COMPUTING TECHNOLOGIES (NGCT), 2015, : 677 - 680
  • [32] Methods of Sentiment Analysis for Hindi and English Languages
    Agrawal, Aarsh
    Bhardwaj, Vinay
    [J]. 2021 INTERNATIONAL CONFERENCE ON COMPUTING SCIENCES (ICCS 2021), 2021, : 295 - 298
  • [33] Sentiment Analysis for Twitter Data in the Hindi Language
    Madan, Anjum
    Ghose, Udayan
    [J]. 2021 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (CONFLUENCE 2021), 2021, : 784 - 789
  • [34] Aspect Based Sentiment Analysis: Category Detection and Sentiment Classification for Hindi
    Akhtar, Md Shad
    Ekbal, Asif
    Bhattacharyya, Pushpak
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT II, 2018, 9624 : 246 - 257
  • [35] A machine learning approach for urdu text sentiment analysis
    Akhtar, Muhammad
    Shoukat, Rana Saud
    Rehman, Saif Ur
    [J]. MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2023, 42 (02) : 75 - 87
  • [36] Medical assistant chatbot Urdu text sentiment analysis
    Syeda Haneen Ashfaq
    Muhammad Ameen Chhajro
    Shahbaz Khan
    Asif Ali Laghari
    [J]. Human-Intelligent Systems Integration, 2024, 6 (1) : 131 - 144
  • [37] Lexicon-based Sentiment Analysis for Urdu Language
    Ul Rehman, Zia
    Bajwa, Imran Sarwar
    [J]. 2016 SIXTH INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING TECHNOLOGY (INTECH), 2016, : 497 - 501
  • [38] Sentiment Analysis on Urdu Tweets Using Markov Chains
    Nasim Z.
    Ghani S.
    [J]. SN Computer Science, 2020, 1 (5)
  • [39] Lexical data augmentation for sentiment analysis
    Xiang, Rong
    Chersoni, Emmanuele
    Lu, Qin
    Huang, Chu-Ren
    Li, Wenjie
    Long, Yunfei
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2021, 72 (11) : 1432 - 1447
  • [40] Deep Sentiment Analysis Using CNN-LSTM Architecture of English and Roman Urdu Text Shared in Social Media
    Khan, Lal
    Amjad, Ammar
    Afaq, Kanwar Muhammad
    Chang, Hsien-Tsung
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (05):