Word Level Language Identification of Code Mixing Text in Social Media using NLP

被引:0
|
作者
Shanmugalingam, Kasthuri [1 ]
Sumathipala, Sagara [1 ]
Premachandra, Chinthaka [2 ]
机构
[1] Univ Moratuwa, Dept Computat Math, Moratuwa, Sri Lanka
[2] Shibaura Inst Technol, Tokyo, Japan
关键词
Code-mixing; NLP; machine learning; language identification;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] Language and Dialect Identification in Social Media Analysis
    Tratz, Stephen
    Briesch, Douglas
    Laoudi, Jamal
    Voss, Clare
    Holland, V. Melissa
    NEXT-GENERATION ANALYST II, 2014, 9122
  • [32] Language Identification and Context-based Analysis of Code-switching Behaviors in Social Media Discussions
    Mishra, Akankshya
    Sharma, Yashvardhan
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 5951 - 5956
  • [33] Korean Language NLP Model Based Emotional Analysis of LGBTQ Social Media Communities
    Chi, Younghyun
    Kim, Jang Hyun
    Sun, Seungjong
    Proceedings of the 2023 17th International Conference on Ubiquitous Information Management and Communication, IMCOM 2023, 2023,
  • [34] Auto capture on drug text detection in social media through NLP from the heterogeneous data
    Lavanya P.M.
    Sasikala E.
    Measurement: Sensors, 2022, 24
  • [35] Sentiment Extraction from Bilingual Code Mixed Social Media Text
    Padmaja, S.
    Fatima, Sameen
    Bandu, Sasidhar
    Nikitha, M.
    Prathyusha, K.
    DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT-2K19, 2020, 1079 : 707 - 714
  • [36] Detecting Propaganda Techniques in Code-Switched Social Media Text
    Salman, Muhammad Umar
    Hanif, Asif
    Shehata, Shady
    Nakov, Preslav
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 16794 - 16812
  • [37] Word-level language identification in The Chymistry of Isaac Newton
    King, Levi
    Kuebler, Sandra
    Hooper, Wallace
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2015, 30 (04) : 532 - 540
  • [38] Event classification from the Urdu language text on social media
    Awan, Malik Daler Ali
    Kajla, Nadeem Iqbal
    Firdous, Amnah
    Husnain, Mujtaba
    Missen, Malik Muhammad Saad
    PEERJ COMPUTER SCIENCE, 2021, 7
  • [39] Event classification from the Urdu language text on social media
    Awan M.D.A.
    Kajla N.I.
    Firdous A.
    Husnain M.
    Missen M.M.S.
    PeerJ Computer Science, 2021, 7
  • [40] Offensive Language Detection on Social Media Based on Text Classification
    Hajibabaee, Parisa
    Malekzadeh, Masoud
    Ahmadi, Mohsen
    Heidari, Maryam
    Esmaeilzadeh, Armin
    Abdolazimi, Reyhaneh
    Jones, James H., Jr.
    2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 92 - 98