Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

被引：2

作者：

Rayala, Upendar Rao ^{[1
,2
]}

Seshadri, Karthick ^{[1
]}

Sristy, Nagesh Bhattu ^{[1
]}

机构：

[1] Natl Inst Technol, Dept Comp Sci & Engn, Tadepalligudem, Andhra Pradesh, India

[2] Rajiv Gandhi Univ Knowledge Technol, Nuzividu 521202, Andhra Pradesh, India

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 10期

关键词：

Code-mixing; sentiment analysis; transliterated text; deep neural networks; syllable-aware embeddings; bidirectional networks; gated recurrent unit; long short-term memory; word embeddings;

D O I：

10.1145/3620670

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning the inherent meaning of a word in Natural Language Processing (NLP) has motivated researchers to represent a word at various levels of abstraction, namely character-level, morpheme-level, and subword-level vector representations. Syllable-Aware Word Embedding (SAWE) can effectively handle agglutinative and fusion-based NLP tasks. However, research attempts on assessing the SAWE on such extrinsic NLP tasks has been scanty, especially for low-resource languages in the context of code-mixing with English. A model to learn SAWE to extract semantics at fine-grained subunits of a word is proposed in this article, and the representative ability of the embeddings is assessed through sentiment analysis of code-mixed Telugu-English review corpora. Multilingual societies and advancements in communication technologies have accounted for the prolific usage of mixed data, which renders the State-of-the-Art (SOTA) sentiment analysis models developed based on monolingual data ineffective. Social media users in the Indian subcontinent exhibit a tendency to mix English and their respective native language (using the phonetic form of English) in expressing their opinions or sentiments. A code-mixing scenario provides flexibility to borrow words from a foreign language, usage of shorthand notations, elongation of vowels, and usage of words without following syntactic/grammatical rules, which renders the sentiment analysis of code-mixed data challenging to perform. Deep neural architectures like Long Short-Term Memory and Gated Recurrent Unit networks have been shown to be effective in solving several NLP tasks, such as sequence labeling, named entity recognition, and machine translation. In this article, a framework to perform sentiment analysis on a code-mixed Telugu-English review corpus is implemented. Both word embedding and SAWE are input to a unified deep neural network that contains a two-level Bidirectional Long Short-Term Memory/Gated Recurrent Unit network with Softmax as the output layer. The proposed model leverages the advantages of both word embedding and SAWE, which enable the proposed model to outperform existing SOTA code-mixed sentiment analysis models on a Telugu-English code-mixed dataset of the International Institute of Information Technology-Hyderabad and a dataset curated by the authors. The improvement realized by the proposed model on these datasets is [3% increase in F1-score and 2% increase in accuracy] and [7% increase in F1-score and 5% in accuracy], respectively, in comparison with the best-performing SOTA model.

引用

下载

页数：30

共 50 条

[1] Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data
Srirangam, Vamshi Krishna
Reddy, Appidi Abhinav
Singh, Vinay
Shrivastava, Manish
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019:): STUDENT RESEARCH WORKSHOP, 2019, : 183 - 189
[2] Word Embeddings for Code-Mixed Language Processing
Pratapa, Adithya
Choudhury, Monojit
Sitaram, Sunayana
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3067 - 3072
[3] Sentiment Analysis of Persian-English Code-mixed Texts
Sabri, Nazanin
Edalat, Ali
Bahrak, Behnam
2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
[4] Sentiment analysis of code-mixed Dravidian languages leveraging pretrained model and word-level language tag
Chanda, Supriya
Mishra, Anshika
Pal, Sukomal
NATURAL LANGUAGE PROCESSING, 2024,
[5] Zero-Shot Sentiment Analysis for Code-Mixed Data
Yadav, Siddharth
Chakraborty, Tanmoy
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 15941 - 15942
[6] Sentiment analysis leveraging emotions and word embeddings
Giatsoglou, Maria
Vozalis, Manolis G.
Diamantaras, Konstantinos
Vakali, Athena
Sarigiannidis, George
Chatzisavvas, Konstantinos Ch.
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 69 : 214 - 224
[7] Bilingual Sentiment Analysis for a Code-mixed Punjabi English Social Media Text
Yadav, Konark
Lamba, Aashish
Gupta, Dhruv
Gupta, Ansh
Karmakar, Purnendu
Saini, Sandeep
PROCEEDINGS OF THE 2020 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND SECURITY (ICCCS-2020), 2020,
[8] An analysis of machine learning models for sentiment analysis of Tamil code-mixed data
Shanmugavadivel, Kogilavani
Sampath, Sai Haritha
Nandhakumar, Pramod
Mahalingam, Prasath
Subramanian, Malliga
Kumaresan, Prasanna Kumar
Priyadharshini, Ruba
COMPUTER SPEECH AND LANGUAGE, 2022, 76
[9] Transformer based multilingual joint learning framework for code-mixed and english sentiment analysis
Mamta
Ekbal, Asif
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024, 62 (01) : 231 - 253
[10] Sentiment Analysis of Code-Mixed Text: A Comprehensive Review
Perera, Anne
Caldera, Amitha
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2024, 30 (02) : 242 - 261

← 1 2 3 4 5 →