An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi

被引：0

作者：

Shekhar, Shashi ^{[1
]}

Sharma, Dilip Kumar ^{[1
]}

Beg, M. M. Sufyan ^{[2
]}

机构：

[1] GLA Univ, Dept Comp Engn & Applicat, Mathura, India

[2] Aligarh Muslim Univ, Dept Comp Engn, Aligarh, Uttar Pradesh, India

来源：

COMPUTACION Y SISTEMAS | 2020年 / 24卷 / 04期

关键词：

Language identification; transliteration; character embedding; word embedding; NLP; machine learning; TRANSLITERATION;

D O I：

10.13053/CyS-24-4-3151

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using BLSTM neural model. In Natural Language Processing one of the imperative and relatively less mature areas is a transliteration. During transliteration, issues like language identification, script specification, missing sounds arise in code mixed data. Social media platforms are now widely used by people to express their opinion or interest. The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. In code-mixed data, one language will be written using another language script. So to process such code-mixed text, identification of language used in each word is important for language processing. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We propose a deep learning framework based on cBoW and Skip gram model for language identification in code mixed data. Popular word embedding features were used for the representation of each word. Many researches have been recently done in the field of language identification, but word level language identification in the transliterated environment is a current research issue in code mixed data. We have implemented a deep learning model based on BLSTM that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The multichannel neural networks combining CNN and BLSTM for word level language identification of code-mixed data where English and Hindi roman transliteration has been used. Combining this with a cBoW and Skip gram for evaluation. The proposed system BLSTM context capture module gives better accuracy for word embedding model as compared to character embedding evaluated on our two testing sets. The problem is modeled collectively with the deep-learning design. We tend to gift an in-depth empirical analysis of the proposed methodology against standard approaches for language identification.

引用

页码：1415 / 1427

页数：13

共 49 条

[31] Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text
Shashi Shekhar
Hitendra Garg
Rohit Agrawal
Shivendra Shivani
Bhisham Sharma
[J]. Complex & Intelligent Systems, 2023, 9 : 2813 - 2826
[32] Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text
Shekhar, Shashi
Garg, Hitendra
Agrawal, Rohit
Shivani, Shivendra
Sharma, Bhisham
[J]. COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (03) : 2813 - 2826
[33] Social media text analytics of Malayalam-English code-mixed using deep learning
Thara, S.
Poornachandran, Prabaharan
[J]. JOURNAL OF BIG DATA, 2022, 9 (01)
[34] Part-of-Speech Tagger for Konkani-English Code-Mixed Social Media Text
Phadte, Akshata
Arsekar, Radhiya
[J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 303 - 307
[35] Language Identification of Bengali-English Code-Mixed Data using Character & Phonetic based LSTM Models
Das, Sourya Dipta
Mandal, Soumil
Das, Dipankar
[J]. PROCEEDINGS OF THE 11TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2019), 2019, : 60 - 64
[36] DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
Bharathi Raja Chakravarthi
Ruba Priyadharshini
Vigneshwaran Muralidaran
Navya Jose
Shardul Suryawanshi
Elizabeth Sherly
John P. McCrae
[J]. Language Resources and Evaluation, 2022, 56 : 765 - 806
[37] DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
Chakravarthi, Bharathi Raja
Priyadharshini, Ruba
Muralidaran, Vigneshwaran
Jose, Navya
Suryawanshi, Shardul
Sherly, Elizabeth
McCrae, John P.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (03) : 765 - 806
[38] Resource Creation for Training and Testing of Normalisation Systems for Konkani-English Code-Mixed Social Media Text
Phadte, Akshata
[J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 264 - 271
[39] Word Level Language Identification of Code Mixing Text in Social Media using NLP
Shanmugalingam, Kasthuri
Sumathipala, Sagara
Premachandra, Chinthaka
[J]. 2018 3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY RESEARCH (ICITR), 2018,
[40] Language Identification and Analysis of Code-Switched Social Media Text
Mave, Deepthi
Maharjan, Suraj
Solorio, Thamar
[J]. COMPUTATIONAL APPROACHES TO LINGUISTIC CODE-SWITCHING, 2018, : 51 - 61

← 1 2 3 4 5 →