A clustering framework for lexical normalization of Roman Urdu

被引：2

作者：

Khan, Abdul Rafae ^{[1
,2
]}

Karim, Asim ^{[3
]}

Sajjad, Hassan ^{[4
]}

Kamiran, Faisal ^{[5
]}

Xu, Jia ^{[1
,2
]}

机构：

[1] Stevens Inst Technol, Hoboken, NJ 07030 USA

[2] CUNY, Comp Sci Dept, Grad Ctr, 365 5th Ave, New York, NY 10016 USA

[3] Lahore Univ Management Sci, Lahore 54792, Pakistan

[4] Hamad Bin Khalifa Univ, Qatar Comp Res Inst, Doha, Qatar

[5] Informat Technol Univ, Arfa Software Technol Pk,Ferozepur Rd, Lahore, Pakistan

来源：

NATURAL LANGUAGE ENGINEERING | 2022年 / 28卷 / 01期

基金：

美国国家科学基金会;

关键词：

Text data mining; Similarity; Machine learning; Phonetic encoding; TEXT NORMALIZATION; MODEL;

D O I：

10.1017/S1351324920000285

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.

引用

下载

页码：93 / 123

页数：31

共 50 条

[1] An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis
Mehmood, Khawar
Essam, Daryl
Shafi, Kamran
Malik, Muhammad Kamran
INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (06)
[2] Lexical Variation and Sentiment Analysis of Roman Urdu Sentences with Deep Neural Networks
Manzoor, Muhammad Arslan
Mamoon, Saqib
Tao, Song Kei
Zakir, Ali
Adil, Muhammad
Lu, Jianfeng
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (02) : 719 - 726
[3] Lexical variation and sentiment analysis of Roman Urdu sentences with deep neural networks
Manzoor M.A.
Mamoon S.
Tao S.K.
Zakir A.
Adil M.
Lu J.
Lu, Jianfeng, 1600, Science and Information Organization : 719 - 726
[4] Lexical Stress in Urdu
Mumtaz, Benazir
Boegel, Tina
Butt, Miriam
INTERSPEECH 2020, 2020, : 1888 - 1892
[5] Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu Language Understanding
Alam, Mehreen
Ul Hussain, Sibt
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
[6] Automatic Detection of Offensive Language for Urdu and Roman Urdu
Akhter, Muhammad Pervez
Zheng Jiangbin
Naqvi, Irfan Raza
Abdelmajeed, Mohammed
Sadiq, Muhammad Tariq
IEEE ACCESS, 2020, 8 (08): : 91213 - 91226
[7] Sentiment Analysis for Roman Urdu
Rafique, Ayesha
Malik, Muhammad Kamran
Nawaz, Zubair
Bukhari, Faisal
Jalbani, Akhtar Hussain
MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2019, 38 (02) : 463 - 470
[8] Sequence to Sequence Networks for Roman-Urdu to Urdu Transliteration
Alam, Mehreen
Hussain, Sibt Ul
2017 INTERNATIONAL MULTI-TOPIC CONFERENCE (INMIC), 2017,
[9] A Review of Urdu Sentiment Analysis with Multilingual Perspective: A Case of Urdu and Roman Urdu Language
Khan, Ihsan Ullah
Khan, Aurangzeb
Khan, Wahab
Su'ud, Mazliham Mohd
Alam, Muhammad Mansoor
Subhan, Fazli
Asghar, Muhammad Zubair
COMPUTERS, 2022, 11 (01)
[10] Deep Learning-based Roman-Urdu to Urdu Transliteration
Alam, Mehreen
ul Hussain, Sibt
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2021, 35 (04)

← 1 2 3 4 5 →