A clustering framework for lexical normalization of Roman Urdu

被引:2
|
作者
Khan, Abdul Rafae [1 ,2 ]
Karim, Asim [3 ]
Sajjad, Hassan [4 ]
Kamiran, Faisal [5 ]
Xu, Jia [1 ,2 ]
机构
[1] Stevens Inst Technol, Hoboken, NJ 07030 USA
[2] CUNY, Comp Sci Dept, Grad Ctr, 365 5th Ave, New York, NY 10016 USA
[3] Lahore Univ Management Sci, Lahore 54792, Pakistan
[4] Hamad Bin Khalifa Univ, Qatar Comp Res Inst, Doha, Qatar
[5] Informat Technol Univ, Arfa Software Technol Pk,Ferozepur Rd, Lahore, Pakistan
基金
美国国家科学基金会;
关键词
Text data mining; Similarity; Machine learning; Phonetic encoding; TEXT NORMALIZATION; MODEL;
D O I
10.1017/S1351324920000285
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.
引用
收藏
页码:93 / 123
页数:31
相关论文
共 50 条
  • [41] General perceptual contributions to lexical tone normalization
    Huang, Jingyuan
    Holt, Lori L.
    Journal of the Acoustical Society of America, 2009, 125 (06): : 3983 - 3994
  • [42] TweetNorm: a benchmark for lexical normalization of Spanish tweets
    Iñaki Alegria
    Nora Aranberri
    Pere R. Comas
    Víctor Fresno
    Pablo Gamallo
    Lluis Padró
    Iñaki San Vicente
    Jordi Turmo
    Arkaitz Zubiaga
    Language Resources and Evaluation, 2015, 49 : 883 - 905
  • [43] Clustering coefficients of lexical neighborhoods
    Altieri, Nicholas
    Gruenenfelder, Thomas
    Pisoni, David B.
    MENTAL LEXICON, 2010, 5 (01): : 1 - 21
  • [44] 'SCENES' IN ROMAN DRAMA: A LEXICAL NOTE
    Ferri, Rolando
    CLASSICAL QUARTERLY, 2008, 58 (02): : 675 - 681
  • [45] Unsupervised Machine Learning based Documents Clustering in Urdu
    Rahman, Atta Ur
    Khan, Khairullah
    Khan, Wahab
    Khan, Aurangzeb
    Saqia, Bibi
    EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2018, 5 (19): : 1 - 13
  • [46] Opinion Mining of Politics and Inflation using Roman Urdu Dataset
    Shafqat, Zunaira
    Iqbal, Muddesar
    Bangyal, Waqas Haider
    Almakhles, Dhafer
    2022 HUMAN-CENTERED COGNITIVE SYSTEMS, HCCS, 2022, : 86 - 91
  • [47] The role of Roman Urdu in multilingual information retrieval: A regional study
    Safdar, Zanab
    Bajwa, Ruqia Safdar
    Hussain, Shafiq
    Abdullah, Haslinda Binti
    Safdar, Kalsoom
    Draz, Umar
    JOURNAL OF ACADEMIC LIBRARIANSHIP, 2020, 46 (06):
  • [48] Sentiment Analysis for a Resource Poor Language-Roman Urdu
    Mehmood, Khawar
    Essam, Daryl
    Shafi, Kamran
    Malik, Muhammad Kamran
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)
  • [49] Discriminative Feature Spamming Technique for Roman Urdu Sentiment Analysis
    Mehmood, Khawar
    Essam, Daryl
    Shafi, Kamran
    Malik, Muhammad Kamran
    IEEE ACCESS, 2019, 7 : 47991 - 48002
  • [50] Roman Urdu News Headline Classification Empowered with Machine Learning
    Naqvi, Rizwan Ali
    Khan, Muhammad Adnan
    Malik, Nauman
    Saqib, Shazia
    Alyas, Tahir
    Hussain, Dildar
    CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 65 (02): : 1221 - 1236