Code-Mixing in Social Media Text The Last Language Identification Frontier?

被引:0
|
作者
Das, Amitava [1 ]
Gamback, Bjoern [2 ]
机构
[1] NIIT Univ, Neemrana 301705, Rajasthan, India
[2] Norwegian Univ Sci & Technol, N-7491 Trondheim, Norway
来源
TRAITEMENT AUTOMATIQUE DES LANGUES | 2013年 / 54卷 / 03期
关键词
Code-mixing; code-switching; social media text; language identification;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
Automatic understanding of noisy social media text is one of the prime presentday research areas. Most research has so far concentrated on English texts; however, more than half of the users are writing in other languages, making language identification a prerequisite for comprehensive processing of social media text. Though language identification has been considered an almost solved problem in other applications, language detectors fail in the social media context due to phenomena such as code-mixing, code-switching, lexical borrowings, Anglicisms, and phonetic typing. This paper reports an initial study to understand the characteristics of code-mixing in the social media context and presents a system developed to automatically detect language boundaries in code-mixed social media text, here exemplified by Facebook messages in mixed English-Bengali and English-Hindi.
引用
收藏
页码:41 / 64
页数:24
相关论文
共 50 条
  • [41] Sentiment Extraction from Bilingual Code Mixed Social Media Text
    Padmaja, S.
    Fatima, Sameen
    Bandu, Sasidhar
    Nikitha, M.
    Prathyusha, K.
    [J]. DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT-2K19, 2020, 1079 : 707 - 714
  • [42] Event classification from the Urdu language text on social media
    Awan, Malik Daler Ali
    Kajla, Nadeem Iqbal
    Firdous, Amnah
    Husnain, Mujtaba
    Missen, Malik Muhammad Saad
    [J]. PEERJ COMPUTER SCIENCE, 2021, 7
  • [43] Offensive Language Detection on Social Media Based on Text Classification
    Hajibabaee, Parisa
    Malekzadeh, Masoud
    Ahmadi, Mohsen
    Heidari, Maryam
    Esmaeilzadeh, Armin
    Abdolazimi, Reyhaneh
    Jones, James H., Jr.
    [J]. 2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 92 - 98
  • [44] Event classification from the Urdu language text on social media
    Awan M.D.A.
    Kajla N.I.
    Firdous A.
    Husnain M.
    Missen M.M.S.
    [J]. PeerJ Computer Science, 2021, 7
  • [45] Analysis of Part of Speech Tags in Language Identification of Code-Mixed Text
    Ansari, Mohd Zeeshan
    Khan, Shazia
    Amani, Tamsil
    Hamid, Aman
    Rizvi, Syed
    [J]. ADVANCES IN COMPUTING AND INTELLIGENT SYSTEMS, ICACM 2019, 2020, : 417 - 425
  • [46] Social Media Corporate User Identification Using Text Classification
    Yang, Zhishen
    Wolkowicz, Jacek
    Keselj, Vlado
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, CANADIAN AI 2014, 2014, 8436 : 363 - 368
  • [47] Language Identification for Social Media: Short Messages and Transliteration
    Cardoso, Pedro Miguel Dias
    Roy, Anindya
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 611 - 614
  • [48] A Natural Language Normalization Approach to Enhance Social Media Text Reasoning
    Long Hoang Nguyen
    Salopek, Andrew
    Zhao, Liang
    Jin, Fang
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2019 - 2026
  • [49] Fighting Offensive Language on Social Media with Unsupervised Text Style Transfer
    dos Santos, Cicero Nogueira
    Melnyk, Igor
    Padhi, Inkit
    [J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 189 - 194
  • [50] Transformer Based Language Identification for Malayalam-English Code-Mixed Text
    Thara, S.
    Poornachandran, Prabaharan
    [J]. IEEE Access, 2021, 9 : 118837 - 118850