An effective cybernated word embedding system for analysis and language identification in code-mixed social media text

被引:4
|
作者
Shekhar, Shashi [1 ]
Sharma, Dilip Kumar [1 ]
Beg, M. M. Sufyan [2 ]
机构
[1] GLA Univ, Dept Comp Engn & Applicat, Mathura 281406, India
[2] Aligarh Muslim Univ, Dept Comp Engn, Aligarh 202002, Uttar Pradesh, India
关键词
Language identification; transliteration; character embedding; word embedding; Natural Language Processing; cBoW; skip-gram;
D O I
10.3233/KES-190409
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. This paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using Bi-directional Long Short Term Memory model. Social media platforms are now widely used by people to express their opinion and interest. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We recommend a deep learning framework based on cBoW and Skip gram model that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The context capture module of the system gives better accuracy for word embedding model as compared to character embedding.
引用
收藏
页码:167 / 179
页数:13
相关论文
共 50 条
  • [1] An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
    Shekhar, Shashi
    Sharma, Dilip Kumar
    Beg, M. M. Sufyan
    COMPUTACION Y SISTEMAS, 2020, 24 (04): : 1415 - 1427
  • [2] Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text
    Veena, P. V.
    Kumar, M. Anand
    Soman, K. P.
    COMPUTACION Y SISTEMAS, 2018, 22 (01): : 65 - 74
  • [3] Word Level Language Identification system for Konkani-English Code-Mixed Social Media Text (CMST)
    Phadte, Akshata
    Wagh, Ramrao
    COMPUTE'17: PROCEEDINGS OF THE 10TH ANNUAL ACM INDIA COMPUTE CONFERENCE, 2017, : 103 - 107
  • [4] A Language Identification System for Code-Mixed English-Manipuri Social Media Text
    Lamabam, Priyadarshini
    Chakma, Kunal
    PROCEEDINGS OF 2ND IEEE INTERNATIONAL CONFERENCE ON ENGINEERING & TECHNOLOGY ICETECH-2016, 2016, : 79 - 83
  • [5] SwitchNet: Learning to switch for word-level language identification in code-mixed social media text
    Sarma, Neelakshi
    Sanasam Singh, Ranbir
    Goswami, Diganta
    NATURAL LANGUAGE ENGINEERING, 2022, 28 (03) : 337 - 359
  • [6] Automatic Language Identification system for code-mixed English-Kannada Social Media Text
    Lakshmi, Sowmya B. S.
    Shambhavi, B. R.
    2017 2ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTION (CSITSS-2017), 2017, : 214 - 218
  • [7] Language identification framework in code-mixed social media text based on quantum LSTM - the word belongs to which language?
    Shekhar, Shashi
    Sharma, Dilip Kumar
    Beg, M. M. Sufyan
    MODERN PHYSICS LETTERS B, 2020, 34 (06):
  • [8] Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text
    Sarma, Neelakshi
    Singh, Sanasam Ranbir
    Goswami, Diganta
    2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 261 - 266
  • [9] Text Normalization in Code-Mixed Social Media Text
    Dutta, Sukanya
    Saha, Tista
    Banerjee, Somnath
    Naskar, Sudip Kumar
    2015 IEEE 2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN INFORMATION SYSTEMS (RETIS), 2015, : 378 - 382
  • [10] Detecting Stance in Kannada Social Media Code-Mixed Text using Sentence Embedding
    Skanda, V. Srinidhi
    Kumar, M. Anand
    Soman, K. P.
    2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 964 - 969