Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora

被引:10
|
作者
Jamatia, Anupam [1 ]
Das, Amitava [2 ]
Gambaeck, Bjoern [3 ]
机构
[1] Natl Inst Technol, Agartala, Tripura, India
[2] Indian Inst Informat Technol, Sricity, Andhra Pradesh, India
[3] Norwegian Univ Sci & Technol, Trondheim, Norway
关键词
Language identification; code-mixing; deep learning;
D O I
10.1515/jisys-2017-0440
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article addresses language identification at the word level in Indian social media corpora taken from Facebook, Twitter and WhatsApp posts that exhibit code-mixing between English-Hindi, English-Bengali, as well as a blend of both language pairs. Code-mixing is a fusion of multiple languages previously mainly associated with spoken language, but which social media users also deploy when communicating in ways that tend to be rather casual. The coarse nature of code-mixed social media text makes language identification challenging. Here, the performance of deep learning on this task is compared to feature-based learning, with two Recursive Neural Network techniques, Long Short Term Memory (LSTM) and bidirectional LSTM, being contrasted to a Conditional Random Fields (CRF) classifier. The results show the deep learners outscoring the CRF, with the bidirectional LSTM demonstrating the best language identification performance.
引用
收藏
页码:399 / 408
页数:10
相关论文
共 50 条
  • [1] Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus
    Jamatia, Anupam
    Swamy, Steve Durairaj
    Gamback, Bjorn
    Das, Amitava
    Debbarma, Swapan
    [J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2020, 29 (05)
  • [2] Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text
    Sarma, Neelakshi
    Singh, Sanasam Ranbir
    Goswami, Diganta
    [J]. 2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 261 - 266
  • [3] Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text
    Veena, P. V.
    Kumar, M. Anand
    Soman, K. P.
    [J]. COMPUTACION Y SISTEMAS, 2018, 22 (01): : 65 - 74
  • [4] Deep Insights of Erroneous Bengali-English Code-Mixed Bilingual Language
    Ganguli, Isha
    Bhowmick, Rajat Subhra
    Sil, Jaya
    [J]. IETE JOURNAL OF RESEARCH, 2023, 69 (06) : 3334 - 3345
  • [5] A Language Identification System for Code-Mixed English-Manipuri Social Media Text
    Lamabam, Priyadarshini
    Chakma, Kunal
    [J]. PROCEEDINGS OF 2ND IEEE INTERNATIONAL CONFERENCE ON ENGINEERING & TECHNOLOGY ICETECH-2016, 2016, : 79 - 83
  • [6] MHE: Code-Mixed Corpora for Similar Language Identification
    Rani, Priya
    McCrae, John P.
    Fransen, Theodorus
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3425 - 3433
  • [7] Social media text analytics of Malayalam–English code-mixed using deep learning
    S. Thara
    Prabaharan Poornachandran
    [J]. Journal of Big Data, 9
  • [8] Hate Speech Detection in Hindi-English Code-Mixed Social Media Text
    Santosh, T. Y. S. S.
    Aravind, K. V. S.
    [J]. PROCEEDINGS OF THE 6TH ACM IKDD CODS AND 24TH COMAD, 2019, : 310 - 313
  • [9] Named Entity Recognition for Hindi-English Code-Mixed Social Media Text
    Singh, Vinay
    Shrivastava, Manish
    Akhtar, Syed Sarfaraz
    Vijay, Deepanshu
    [J]. NAMED ENTITIES, 2018, : 27 - 35
  • [10] Automatic Language Identification system for code-mixed English-Kannada Social Media Text
    Lakshmi, Sowmya B. S.
    Shambhavi, B. R.
    [J]. 2017 2ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTION (CSITSS-2017), 2017, : 214 - 218