Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora

被引：10

作者：

Jamatia, Anupam ^{[1
]}

Das, Amitava ^{[2
]}

Gambaeck, Bjoern ^{[3
]}

机构：

[1] Natl Inst Technol, Agartala, Tripura, India

[2] Indian Inst Informat Technol, Sricity, Andhra Pradesh, India

[3] Norwegian Univ Sci & Technol, Trondheim, Norway

来源：

JOURNAL OF INTELLIGENT SYSTEMS | 2019年 / 28卷 / 03期

关键词：

Language identification; code-mixing; deep learning;

D O I：

10.1515/jisys-2017-0440

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This article addresses language identification at the word level in Indian social media corpora taken from Facebook, Twitter and WhatsApp posts that exhibit code-mixing between English-Hindi, English-Bengali, as well as a blend of both language pairs. Code-mixing is a fusion of multiple languages previously mainly associated with spoken language, but which social media users also deploy when communicating in ways that tend to be rather casual. The coarse nature of code-mixed social media text makes language identification challenging. Here, the performance of deep learning on this task is compared to feature-based learning, with two Recursive Neural Network techniques, Long Short Term Memory (LSTM) and bidirectional LSTM, being contrasted to a Conditional Random Fields (CRF) classifier. The results show the deep learners outscoring the CRF, with the bidirectional LSTM demonstrating the best language identification performance.

引用

页码：399 / 408

页数：10

共 50 条

[1] Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus
Jamatia, Anupam
Swamy, Steve Durairaj
Gamback, Bjorn
Das, Amitava
Debbarma, Swapan
[J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2020, 29 (05)
[2] Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text
Sarma, Neelakshi
Singh, Sanasam Ranbir
Goswami, Diganta
[J]. 2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 261 - 266
[3] Character Embedding for Language Identification in Hindi-English Code-mixed Social Media Text
Veena, P. V.
Kumar, M. Anand
Soman, K. P.
[J]. COMPUTACION Y SISTEMAS, 2018, 22 (01): : 65 - 74
[4] Deep Insights of Erroneous Bengali-English Code-Mixed Bilingual Language
Ganguli, Isha
Bhowmick, Rajat Subhra
Sil, Jaya
[J]. IETE JOURNAL OF RESEARCH, 2023, 69 (06) : 3334 - 3345
[5] A Language Identification System for Code-Mixed English-Manipuri Social Media Text
Lamabam, Priyadarshini
Chakma, Kunal
[J]. PROCEEDINGS OF 2ND IEEE INTERNATIONAL CONFERENCE ON ENGINEERING & TECHNOLOGY ICETECH-2016, 2016, : 79 - 83
[6] MHE: Code-Mixed Corpora for Similar Language Identification
Rani, Priya
McCrae, John P.
Fransen, Theodorus
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3425 - 3433
[7] Social media text analytics of Malayalam–English code-mixed using deep learning
S. Thara
Prabaharan Poornachandran
[J]. Journal of Big Data, 9
[8] Hate Speech Detection in Hindi-English Code-Mixed Social Media Text
Santosh, T. Y. S. S.
Aravind, K. V. S.
[J]. PROCEEDINGS OF THE 6TH ACM IKDD CODS AND 24TH COMAD, 2019, : 310 - 313
[9] Named Entity Recognition for Hindi-English Code-Mixed Social Media Text
Singh, Vinay
Shrivastava, Manish
Akhtar, Syed Sarfaraz
Vijay, Deepanshu
[J]. NAMED ENTITIES, 2018, : 27 - 35
[10] Automatic Language Identification system for code-mixed English-Kannada Social Media Text
Lakshmi, Sowmya B. S.
Shambhavi, B. R.
[J]. 2017 2ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND INFORMATION TECHNOLOGY FOR SUSTAINABLE SOLUTION (CSITSS-2017), 2017, : 214 - 218

← 1 2 3 4 5 →