Word Level Language Identification of Code Mixing Text in Social Media using NLP

被引:0
|
作者
Shanmugalingam, Kasthuri [1 ]
Sumathipala, Sagara [1 ]
Premachandra, Chinthaka [2 ]
机构
[1] Univ Moratuwa, Dept Computat Math, Moratuwa, Sri Lanka
[2] Shibaura Inst Technol, Tokyo, Japan
关键词
Code-mixing; NLP; machine learning; language identification;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.
引用
收藏
页数:5
相关论文
共 50 条
  • [41] Social media text analytics of Malayalam-English code-mixed using deep learning
    Thara, S.
    Poornachandran, Prabaharan
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [42] High Level Event Identification in Social Media
    Dashdorj, Zolzaya
    Tsogtbaatar, Battushig
    Tumurchudur, Altangerel
    Altangerel, Erdenebaatar
    PROCEEDINGS OF 2016 12TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG), 2016, : 121 - 125
  • [43] Word Level Script Identification of Text in Low Resolution Images of Display Boards Using Wavelet Features
    Angadi, S. A.
    Kodabagi, M. M.
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, 2013, 174 : 209 - 220
  • [44] Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text
    Ansari, Mohd Zeeshan
    Khan, Shazia
    Amani, Tamsil
    Hamid, Aman
    Rizvi, Syed
    ADVANCES IN COMPUTING AND INTELLIGENT SYSTEMS, ICACM 2019, 2020, : 417 - 425
  • [45] Media Bias, the Social Sciences, and NLP: Automating Frame Analyses to Identify Bias by Word Choice and Labeling
    Hamborg, Felix
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 79 - 87
  • [46] Social Media Rumor Refuter Feature Analysis and Crowd Identification Based on XGBoost and NLP
    Li, Zongmin
    Zhang, Qi
    Wang, Yuhong
    Wang, Shihang
    APPLIED SCIENCES-BASEL, 2020, 10 (14):
  • [47] Language Identification for Social Media: Short Messages and Transliteration
    Cardoso, Pedro Miguel Dias
    Roy, Anindya
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 611 - 614
  • [48] CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts
    Lakshmaiah, Shashirekha Hosahalli
    Balouchzahi, Fazlourrahman
    Anusha, Mudoor Devadas
    Sidorov, Grigori
    ACTA POLYTECHNICA HUNGARICA, 2022, 19 (10) : 123 - 141
  • [49] Social, economic, and demographic factors drive the emergence of Hinglish code-mixing on social media
    Sengupta, Ayan
    Das, Soham
    Akhtar, Md. Shad
    Chakraborty, Tanmoy
    HUMANITIES & SOCIAL SCIENCES COMMUNICATIONS, 2024, 11 (01):
  • [50] Bilingual Code-Mixing in Indian Social Media Texts for Hindi and English
    Kumar, Rajesh
    Singh, Pardeep
    ADVANCED INFORMATICS FOR COMPUTING RESEARCH, ICAICR 2017, 2017, 712 : 121 - 129