Word Level Language Identification of Code Mixing Text in Social Media using NLP

被引:0
|
作者
Shanmugalingam, Kasthuri [1 ]
Sumathipala, Sagara [1 ]
Premachandra, Chinthaka [2 ]
机构
[1] Univ Moratuwa, Dept Computat Math, Moratuwa, Sri Lanka
[2] Shibaura Inst Technol, Tokyo, Japan
关键词
Code-mixing; NLP; machine learning; language identification;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] Solving Social Media Text Classification Problems using Code Fragment based XCSR
    Arif, Muhammad Hassan
    Li, Jianxin
    Iqbal, Muhammad
    2017 IEEE 29TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2017), 2017, : 485 - 492
  • [22] An Effective Way of Word-level Language Identification for Code-mixed Facebook comments using Word-Embedding via Character-embedding
    Veena, P. V.
    Kumar, Anand M.
    Soman, K. P.
    2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 1552 - 1556
  • [23] Language Identification and Transliteration approaches for Code-Mixed Text
    Kumbhar M.
    Thakre K.
    Journal of Engineering Science and Technology Review, 2024, 17 (01) : 63 - 70
  • [24] Automatic Token and Turn Level Language Identification for Code-Switched Text Dialog: An Analysis Across Language Pairs and Corpora
    Ramanarayanan, Vikram
    Pugh, Robert
    19TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2018), 2018, : 80 - 88
  • [25] Word-level and phrase-level strategies for figurative text identification
    Qimeng Yang
    Long Yu
    Shengwei Tian
    Jinmiao Song
    Multimedia Tools and Applications, 2022, 81 : 14339 - 14353
  • [26] Word-level and phrase-level strategies for figurative text identification
    Yang, Qimeng
    Yu, Long
    Tian, Shengwei
    Song, Jinmiao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (10) : 14339 - 14353
  • [27] Detecting Stance in Kannada Social Media Code-Mixed Text using Sentence Embedding
    Skanda, V. Srinidhi
    Kumar, M. Anand
    Soman, K. P.
    2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 964 - 969
  • [28] Deep Learning Techniques on Text Classification Using Natural Language Processing (NLP) In Social Healthcare Network: A Comprehensive Survey
    Lavanya, P. M.
    Sasikala, E.
    ICSPC'21: 2021 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION (ICPSC), 2021, : 603 - 609
  • [29] Social media text analytics of Malayalam–English code-mixed using deep learning
    S. Thara
    Prabaharan Poornachandran
    Journal of Big Data, 9
  • [30] Rumor Identification and Verification for Text in Social Media Content
    Devi, P. Suthanthira
    Karthika, And S.
    COMPUTER JOURNAL, 2022, 65 (02): : 436 - 455