Word Level Language Identification of Code Mixing Text in Social Media using NLP

被引:0
|
作者
Shanmugalingam, Kasthuri [1 ]
Sumathipala, Sagara [1 ]
Premachandra, Chinthaka [2 ]
机构
[1] Univ Moratuwa, Dept Computat Math, Moratuwa, Sri Lanka
[2] Shibaura Inst Technol, Tokyo, Japan
关键词
Code-mixing; NLP; machine learning; language identification;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Code-Mixing in Social Media Text The Last Language Identification Frontier?
    Das, Amitava
    Gamback, Bjoern
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2013, 54 (03): : 41 - 64
  • [2] SwitchNet: Learning to switch for word-level language identification in code-mixed social media text
    Sarma, Neelakshi
    Sanasam Singh, Ranbir
    Goswami, Diganta
    NATURAL LANGUAGE ENGINEERING, 2022, 28 (03) : 337 - 359
  • [3] Word Level Language Identification system for Konkani-English Code-Mixed Social Media Text (CMST)
    Phadte, Akshata
    Wagh, Ramrao
    COMPUTE'17: PROCEEDINGS OF THE 10TH ANNUAL ACM INDIA COMPUTE CONFERENCE, 2017, : 103 - 107
  • [4] Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text
    Sarma, Neelakshi
    Singh, Sanasam Ranbir
    Goswami, Diganta
    2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 261 - 266
  • [5] Language Identification and Analysis of Code-Switched Social Media Text
    Mave, Deepthi
    Maharjan, Suraj
    Solorio, Thamar
    COMPUTATIONAL APPROACHES TO LINGUISTIC CODE-SWITCHING, 2018, : 51 - 61
  • [6] An effective cybernated word embedding system for analysis and language identification in code-mixed social media text
    Shekhar, Shashi
    Sharma, Dilip Kumar
    Beg, M. M. Sufyan
    INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS, 2019, 23 (03) : 167 - 179
  • [7] Language identification framework in code-mixed social media text based on quantum LSTM - the word belongs to which language?
    Shekhar, Shashi
    Sharma, Dilip Kumar
    Beg, M. M. Sufyan
    MODERN PHYSICS LETTERS B, 2020, 34 (06):
  • [8] A Language Identification System for Code-Mixed English-Manipuri Social Media Text
    Lamabam, Priyadarshini
    Chakma, Kunal
    PROCEEDINGS OF 2ND IEEE INTERNATIONAL CONFERENCE ON ENGINEERING & TECHNOLOGY ICETECH-2016, 2016, : 79 - 83
  • [9] Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text
    Bansal, Neetika
    Goyal, Vishal
    Rani, Simpel
    INTERNATIONAL JOURNAL OF E-ADOPTION, 2020, 12 (01) : 52 - 62
  • [10] Code-Mixing and Code-Switching on Social Media Text: A Brief Survey
    Mangla, Ankur
    Bansal, Rakesh Kumar
    Bansal, Savina
    Proceedings of the 2023 IEEE International Conference on Computer Vision and Machine Intelligence, CVMI 2023, 2023,