An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi

被引:0
|
作者
Shekhar, Shashi [1 ]
Sharma, Dilip Kumar [1 ]
Beg, M. M. Sufyan [2 ]
机构
[1] GLA Univ, Dept Comp Engn & Applicat, Mathura, India
[2] Aligarh Muslim Univ, Dept Comp Engn, Aligarh, Uttar Pradesh, India
来源
COMPUTACION Y SISTEMAS | 2020年 / 24卷 / 04期
关键词
Language identification; transliteration; character embedding; word embedding; NLP; machine learning; TRANSLITERATION;
D O I
10.13053/CyS-24-4-3151
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using BLSTM neural model. In Natural Language Processing one of the imperative and relatively less mature areas is a transliteration. During transliteration, issues like language identification, script specification, missing sounds arise in code mixed data. Social media platforms are now widely used by people to express their opinion or interest. The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. In code-mixed data, one language will be written using another language script. So to process such code-mixed text, identification of language used in each word is important for language processing. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We propose a deep learning framework based on cBoW and Skip gram model for language identification in code mixed data. Popular word embedding features were used for the representation of each word. Many researches have been recently done in the field of language identification, but word level language identification in the transliterated environment is a current research issue in code mixed data. We have implemented a deep learning model based on BLSTM that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The multichannel neural networks combining CNN and BLSTM for word level language identification of code-mixed data where English and Hindi roman transliteration has been used. Combining this with a cBoW and Skip gram for evaluation. The proposed system BLSTM context capture module gives better accuracy for word embedding model as compared to character embedding evaluated on our two testing sets. The problem is modeled collectively with the deep-learning design. We tend to gift an in-depth empirical analysis of the proposed methodology against standard approaches for language identification.
引用
收藏
页码:1415 / 1427
页数:13
相关论文
共 49 条
  • [21] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
    Thara, S.
    Poornachandran, Prabaharan
    IEEE ACCESS, 2021, 9 : 118837 - 118850
  • [22] Detecting Stance in Kannada Social Media Code-Mixed Text using Sentence Embedding
    Skanda, V. Srinidhi
    Kumar, M. Anand
    Soman, K. P.
    2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 964 - 969
  • [23] Word Level Language Identification in Code-Mixed Data using Word Embedding Methods for Indian Languages
    Chaitanya, Inumella
    Madapakula, Indeevar
    Gupta, Subham Kumar
    Thara, S.
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1137 - 1141
  • [24] Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text
    Ansari, Mohd Zeeshan
    Khan, Shazia
    Amani, Tamsil
    Hamid, Aman
    Rizvi, Syed
    ADVANCES IN COMPUTING AND INTELLIGENT SYSTEMS, ICACM 2019, 2020, : 417 - 425
  • [25] An Effective Way of Word-level Language Identification for Code-mixed Facebook comments using Word-Embedding via Character-embedding
    Veena, P. V.
    Kumar, Anand M.
    Soman, K. P.
    2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 1552 - 1556
  • [26] Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus
    Jamatia, Anupam
    Swamy, Steve Durairaj
    Gamback, Bjorn
    Das, Amitava
    Debbarma, Swapan
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2020, 29 (05)
  • [27] Social media text analytics of Malayalam–English code-mixed using deep learning
    S. Thara
    Prabaharan Poornachandran
    Journal of Big Data, 9
  • [28] Deep Learning Technique for Sentiment Analysis of Hindi-English Code-Mixed Text using Late Fusion of Character and Word Features
    Mukherjee, Siddhartha
    2019 IEEE 16TH INDIA COUNCIL INTERNATIONAL CONFERENCE (IEEE INDICON 2019), 2019,
  • [29] Sentiment Analysis for Code-Mixed Indian Social Media Text With Distributed Representation
    Shalini, K.
    Ganesh, Barathi H. B.
    Kumar, Anand M.
    Soman, K. P.
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1126 - 1131
  • [30] Comparative Analysis of Social Media Hate Detection over Code Mixed Hindi-English Language
    Pareek, Kapil
    Choudhary, Arjun
    Tripathi, Ashish
    Mishra, K. K.
    ADVANCES IN DATA AND INFORMATION SCIENCES, 2022, 318 : 551 - 561