An effective cybernated word embedding system for analysis and language identification in code-mixed social media text

被引:4
|
作者
Shekhar, Shashi [1 ]
Sharma, Dilip Kumar [1 ]
Beg, M. M. Sufyan [2 ]
机构
[1] GLA Univ, Dept Comp Engn & Applicat, Mathura 281406, India
[2] Aligarh Muslim Univ, Dept Comp Engn, Aligarh 202002, Uttar Pradesh, India
关键词
Language identification; transliteration; character embedding; word embedding; Natural Language Processing; cBoW; skip-gram;
D O I
10.3233/KES-190409
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. This paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using Bi-directional Long Short Term Memory model. Social media platforms are now widely used by people to express their opinion and interest. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We recommend a deep learning framework based on cBoW and Skip gram model that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The context capture module of the system gives better accuracy for word embedding model as compared to character embedding.
引用
收藏
页码:167 / 179
页数:13
相关论文
共 50 条
  • [21] Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text
    Bansal, Neetika
    Goyal, Vishal
    Rani, Simpel
    INTERNATIONAL JOURNAL OF E-ADOPTION, 2020, 12 (01) : 52 - 62
  • [22] DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
    Chakravarthi, Bharathi Raja
    Priyadharshini, Ruba
    Muralidaran, Vigneshwaran
    Jose, Navya
    Suryawanshi, Shardul
    Sherly, Elizabeth
    McCrae, John P.
    LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (03) : 765 - 806
  • [23] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
    Thara, S.
    Poornachandran, Prabaharan
    IEEE Access, 2021, 9 : 118837 - 118850
  • [24] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
    Thara, S.
    Poornachandran, Prabaharan
    IEEE ACCESS, 2021, 9 : 118837 - 118850
  • [25] Word Level Language Identification of Code Mixing Text in Social Media using NLP
    Shanmugalingam, Kasthuri
    Sumathipala, Sagara
    Premachandra, Chinthaka
    2018 3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY RESEARCH (ICITR), 2018,
  • [26] A Comparison Study of Word Embedding for Detecting Named Entities of Code-Mixed Data in Indian Language
    Sravani, Lolla
    Reddy, Atla Sowmya
    Thara, S.
    2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 2375 - 2381
  • [27] Language Identification and Analysis of Code-Switched Social Media Text
    Mave, Deepthi
    Maharjan, Suraj
    Solorio, Thamar
    COMPUTATIONAL APPROACHES TO LINGUISTIC CODE-SWITCHING, 2018, : 51 - 61
  • [28] Distributional Word Representations for Code-mixed Text in Moroccan Darija
    Aghzal, Mohamed
    Mourhir, Asmaa
    AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 266 - 273
  • [29] CMHE-AN: Code mixed hybrid embedding based attention network for aggression identification in hindi english code-mixed text
    Shikha Mundra
    Namita Mittal
    Multimedia Tools and Applications, 2023, 82 : 11337 - 11364
  • [30] CMHE-AN: Code mixed hybrid embedding based attention network for aggression identification in hindi english code-mixed text
    Mundra, Shikha
    Mittal, Namita
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (08) : 11337 - 11364