An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi

被引:0
|
作者
Shekhar, Shashi [1 ]
Sharma, Dilip Kumar [1 ]
Beg, M. M. Sufyan [2 ]
机构
[1] GLA Univ, Dept Comp Engn & Applicat, Mathura, India
[2] Aligarh Muslim Univ, Dept Comp Engn, Aligarh, Uttar Pradesh, India
来源
COMPUTACION Y SISTEMAS | 2020年 / 24卷 / 04期
关键词
Language identification; transliteration; character embedding; word embedding; NLP; machine learning; TRANSLITERATION;
D O I
10.13053/CyS-24-4-3151
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using BLSTM neural model. In Natural Language Processing one of the imperative and relatively less mature areas is a transliteration. During transliteration, issues like language identification, script specification, missing sounds arise in code mixed data. Social media platforms are now widely used by people to express their opinion or interest. The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. In code-mixed data, one language will be written using another language script. So to process such code-mixed text, identification of language used in each word is important for language processing. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We propose a deep learning framework based on cBoW and Skip gram model for language identification in code mixed data. Popular word embedding features were used for the representation of each word. Many researches have been recently done in the field of language identification, but word level language identification in the transliterated environment is a current research issue in code mixed data. We have implemented a deep learning model based on BLSTM that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The multichannel neural networks combining CNN and BLSTM for word level language identification of code-mixed data where English and Hindi roman transliteration has been used. Combining this with a cBoW and Skip gram for evaluation. The proposed system BLSTM context capture module gives better accuracy for word embedding model as compared to character embedding evaluated on our two testing sets. The problem is modeled collectively with the deep-learning design. We tend to gift an in-depth empirical analysis of the proposed methodology against standard approaches for language identification.
引用
收藏
页码:1415 / 1427
页数:13
相关论文
共 49 条
  • [41] CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts
    Lakshmaiah, Shashirekha Hosahalli
    Balouchzahi, Fazlourrahman
    Anusha, Mudoor Devadas
    Sidorov, Grigori
    ACTA POLYTECHNICA HUNGARICA, 2022, 19 (10) : 123 - 141
  • [42] Sentiment Analysis of Code-Mixed Social Media Text (SA-CMSMT) in Indian-Languages
    Ahmad, Gazi Imtiyaz
    Singla, Jimmy
    2021 INTERNATIONAL CONFERENCE ON COMPUTING SCIENCES (ICCS 2021), 2021, : 25 - 33
  • [43] Mining e-cigarette adverse events in social media using Bi-LSTM recurrent neural network with word embedding representation
    Xie, Jiaheng
    Liu, Xiao
    Zeng, Daniel Dajun
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2018, 25 (01) : 72 - 80
  • [44] Sentiment Analysis of Code-Mixed Bambara-French Social Media Text Using Deep Learning Techniques
    Arouna KONATE
    DU Ruiying
    Wuhan University Journal of Natural Sciences, 2018, 23 (03) : 237 - 243
  • [45] Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data
    Srirangam, Vamshi Krishna
    Reddy, Appidi Abhinav
    Singh, Vinay
    Shrivastava, Manish
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019:): STUDENT RESEARCH WORKSHOP, 2019, : 183 - 189
  • [46] Machine Learning Techniques for Sentiment Analysis of Code-Mixed and Switched Indian Social Media Text Corpus: A Comprehensive Review
    Ahmad, Gazi Imtiyaz
    Singla, Jimmy
    Ali, Anis
    Reshi, Aijaz Ahmad
    Salameh, Anas A.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (02) : 455 - 467
  • [47] Deep Sentiment Analysis Using CNN-LSTM Architecture of English and Roman Urdu Text Shared in Social Media
    Khan, Lal
    Amjad, Ammar
    Afaq, Kanwar Muhammad
    Chang, Hsien-Tsung
    APPLIED SCIENCES-BASEL, 2022, 12 (05):
  • [48] Sentiment Analysis and Offensive Language Identification in Code-Mixed Tamil-English Languages Using Transformer-Based Models
    Ponnambalam, Satheesh Kumar
    Desai, Darshana
    ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 149 - 167
  • [49] Annotated dataset for sentiment analysis and sarcasm detection: Bilingual code-mixed English-Malay social media data in the public security domain
    Suhaimin, Mohd Suhairi Md
    Hijazi, Mohd Hanafi Ahmad
    Moung, Ervin Gubin
    DATA IN BRIEF, 2024, 55