A clustering framework for lexical normalization of Roman Urdu

被引:2
|
作者
Khan, Abdul Rafae [1 ,2 ]
Karim, Asim [3 ]
Sajjad, Hassan [4 ]
Kamiran, Faisal [5 ]
Xu, Jia [1 ,2 ]
机构
[1] Stevens Inst Technol, Hoboken, NJ 07030 USA
[2] CUNY, Comp Sci Dept, Grad Ctr, 365 5th Ave, New York, NY 10016 USA
[3] Lahore Univ Management Sci, Lahore 54792, Pakistan
[4] Hamad Bin Khalifa Univ, Qatar Comp Res Inst, Doha, Qatar
[5] Informat Technol Univ, Arfa Software Technol Pk,Ferozepur Rd, Lahore, Pakistan
基金
美国国家科学基金会;
关键词
Text data mining; Similarity; Machine learning; Phonetic encoding; TEXT NORMALIZATION; MODEL;
D O I
10.1017/S1351324920000285
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.
引用
收藏
页码:93 / 123
页数:31
相关论文
共 50 条
  • [31] Opinion Mining in Roman Urdu using Baseline Classifiers
    Sharf, Zareen
    Mansoor, Husnain Ali
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2018, 18 (09): : 156 - 164
  • [32] Performing Natural Language Processing on Roman Urdu Datasets
    Sharf, Zareen
    Rahman, Saif Ur
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2018, 18 (01): : 141 - 148
  • [33] Roman Urdu Sentiment Analysis Using Transfer Learning
    Li, Dun
    Ahmed, Kanwal
    Zheng, Zhiyun
    Mohsan, Syed Agha Hassnain
    Alsharif, Mohammed H.
    Hadjouni, Myriam
    Jamjoom, Mona M.
    Mostafa, Samih M.
    APPLIED SCIENCES-BASEL, 2022, 12 (20):
  • [34] Detecting Spam Product Reviews in Roman Urdu Script
    Hussain N.
    Mirza H.T.
    Iqbal F.
    Hussain I.
    Kaleem M.
    Computer Journal, 2021, 64 (03): : 432 - 450
  • [35] Transtech: development of a novel translator for Roman Urdu to English
    Masroor, Hafsa
    Saeed, Muhammad
    Feroz, Maryam
    Ahsan, Kamran
    Islam, Khawar
    HELIYON, 2019, 5 (05)
  • [36] An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu Sentiment analysis on short text classification in Roman Urdu
    Rana, Toqir A.
    Shahzadi, Kiran
    Rana, Tauseef
    Arshad, Ahsan
    Tubishat, Mohammad
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [37] Lexical Normalization Model for Noisy SMS Text
    Jose, Greety
    Raj, Nisha. S.
    2014 FIRST INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND COMMUNICATIONS (ICCSC), 2014, : 57 - 62
  • [38] TweetNorm: a benchmark for lexical normalization of Spanish tweets
    Alegria, Inaki
    Aranberri, Nora
    Comas, Pere R.
    Fresno, Victor
    Gamallo, Pablo
    Padro, Lluis
    San Vicente, Inaki
    Turmo, Jordi
    Zubiaga, Arkaitz
    LANGUAGE RESOURCES AND EVALUATION, 2015, 49 (04) : 883 - 905
  • [39] Lexical Intent Recognition in Urdu Queries Using Deep Neural Networks
    Shams, Sana
    Aslam, Muhammad
    Maria Martinez-Enriquez, Ana
    ADVANCES IN SOFT COMPUTING, MICAI 2019, 2019, 11835 : 39 - 50
  • [40] General perceptual contributions to lexical tone normalization
    Huang, Jingyuan
    Holt, Lori L.
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2009, 125 (06): : 3983 - 3994