A clustering framework for lexical normalization of Roman Urdu

被引:2
|
作者
Khan, Abdul Rafae [1 ,2 ]
Karim, Asim [3 ]
Sajjad, Hassan [4 ]
Kamiran, Faisal [5 ]
Xu, Jia [1 ,2 ]
机构
[1] Stevens Inst Technol, Hoboken, NJ 07030 USA
[2] CUNY, Comp Sci Dept, Grad Ctr, 365 5th Ave, New York, NY 10016 USA
[3] Lahore Univ Management Sci, Lahore 54792, Pakistan
[4] Hamad Bin Khalifa Univ, Qatar Comp Res Inst, Doha, Qatar
[5] Informat Technol Univ, Arfa Software Technol Pk,Ferozepur Rd, Lahore, Pakistan
基金
美国国家科学基金会;
关键词
Text data mining; Similarity; Machine learning; Phonetic encoding; TEXT NORMALIZATION; MODEL;
D O I
10.1017/S1351324920000285
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.
引用
下载
收藏
页码:93 / 123
页数:31
相关论文
共 50 条
  • [1] An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis
    Mehmood, Khawar
    Essam, Daryl
    Shafi, Kamran
    Malik, Muhammad Kamran
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (06)
  • [2] Lexical Variation and Sentiment Analysis of Roman Urdu Sentences with Deep Neural Networks
    Manzoor, Muhammad Arslan
    Mamoon, Saqib
    Tao, Song Kei
    Zakir, Ali
    Adil, Muhammad
    Lu, Jianfeng
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (02) : 719 - 726
  • [3] Lexical variation and sentiment analysis of Roman Urdu sentences with deep neural networks
    Manzoor M.A.
    Mamoon S.
    Tao S.K.
    Zakir A.
    Adil M.
    Lu J.
    Lu, Jianfeng, 1600, Science and Information Organization : 719 - 726
  • [4] Lexical Stress in Urdu
    Mumtaz, Benazir
    Boegel, Tina
    Butt, Miriam
    INTERSPEECH 2020, 2020, : 1888 - 1892
  • [5] Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu Language Understanding
    Alam, Mehreen
    Ul Hussain, Sibt
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [6] Automatic Detection of Offensive Language for Urdu and Roman Urdu
    Akhter, Muhammad Pervez
    Zheng Jiangbin
    Naqvi, Irfan Raza
    Abdelmajeed, Mohammed
    Sadiq, Muhammad Tariq
    IEEE ACCESS, 2020, 8 (08): : 91213 - 91226
  • [7] Sentiment Analysis for Roman Urdu
    Rafique, Ayesha
    Malik, Muhammad Kamran
    Nawaz, Zubair
    Bukhari, Faisal
    Jalbani, Akhtar Hussain
    MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2019, 38 (02) : 463 - 470
  • [8] Sequence to Sequence Networks for Roman-Urdu to Urdu Transliteration
    Alam, Mehreen
    Hussain, Sibt Ul
    2017 INTERNATIONAL MULTI-TOPIC CONFERENCE (INMIC), 2017,
  • [9] A Review of Urdu Sentiment Analysis with Multilingual Perspective: A Case of Urdu and Roman Urdu Language
    Khan, Ihsan Ullah
    Khan, Aurangzeb
    Khan, Wahab
    Su'ud, Mazliham Mohd
    Alam, Muhammad Mansoor
    Subhan, Fazli
    Asghar, Muhammad Zubair
    COMPUTERS, 2022, 11 (01)
  • [10] Deep Learning-based Roman-Urdu to Urdu Transliteration
    Alam, Mehreen
    ul Hussain, Sibt
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2021, 35 (04)