Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

被引:2
|
作者
Rayala, Upendar Rao [1 ,2 ]
Seshadri, Karthick [1 ]
Sristy, Nagesh Bhattu [1 ]
机构
[1] Natl Inst Technol, Dept Comp Sci & Engn, Tadepalligudem, Andhra Pradesh, India
[2] Rajiv Gandhi Univ Knowledge Technol, Nuzividu 521202, Andhra Pradesh, India
关键词
Code-mixing; sentiment analysis; transliterated text; deep neural networks; syllable-aware embeddings; bidirectional networks; gated recurrent unit; long short-term memory; word embeddings;
D O I
10.1145/3620670
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning the inherent meaning of a word in Natural Language Processing (NLP) has motivated researchers to represent a word at various levels of abstraction, namely character-level, morpheme-level, and subword-level vector representations. Syllable-Aware Word Embedding (SAWE) can effectively handle agglutinative and fusion-based NLP tasks. However, research attempts on assessing the SAWE on such extrinsic NLP tasks has been scanty, especially for low-resource languages in the context of code-mixing with English. A model to learn SAWE to extract semantics at fine-grained subunits of a word is proposed in this article, and the representative ability of the embeddings is assessed through sentiment analysis of code-mixed Telugu-English review corpora. Multilingual societies and advancements in communication technologies have accounted for the prolific usage of mixed data, which renders the State-of-the-Art (SOTA) sentiment analysis models developed based on monolingual data ineffective. Social media users in the Indian subcontinent exhibit a tendency to mix English and their respective native language (using the phonetic form of English) in expressing their opinions or sentiments. A code-mixing scenario provides flexibility to borrow words from a foreign language, usage of shorthand notations, elongation of vowels, and usage of words without following syntactic/grammatical rules, which renders the sentiment analysis of code-mixed data challenging to perform. Deep neural architectures like Long Short-Term Memory and Gated Recurrent Unit networks have been shown to be effective in solving several NLP tasks, such as sequence labeling, named entity recognition, and machine translation. In this article, a framework to perform sentiment analysis on a code-mixed Telugu-English review corpus is implemented. Both word embedding and SAWE are input to a unified deep neural network that contains a two-level Bidirectional Long Short-Term Memory/Gated Recurrent Unit network with Softmax as the output layer. The proposed model leverages the advantages of both word embedding and SAWE, which enable the proposed model to outperform existing SOTA code-mixed sentiment analysis models on a Telugu-English code-mixed dataset of the International Institute of Information Technology-Hyderabad and a dataset curated by the authors. The improvement realized by the proposed model on these datasets is [3% increase in F1-score and 2% increase in accuracy] and [7% increase in F1-score and 5% in accuracy], respectively, in comparison with the best-performing SOTA model.
引用
下载
收藏
页数:30
相关论文
共 50 条
  • [1] Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data
    Srirangam, Vamshi Krishna
    Reddy, Appidi Abhinav
    Singh, Vinay
    Shrivastava, Manish
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019:): STUDENT RESEARCH WORKSHOP, 2019, : 183 - 189
  • [2] Word Embeddings for Code-Mixed Language Processing
    Pratapa, Adithya
    Choudhury, Monojit
    Sitaram, Sunayana
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3067 - 3072
  • [3] Sentiment Analysis of Persian-English Code-mixed Texts
    Sabri, Nazanin
    Edalat, Ali
    Bahrak, Behnam
    2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
  • [4] Sentiment analysis of code-mixed Dravidian languages leveraging pretrained model and word-level language tag
    Chanda, Supriya
    Mishra, Anshika
    Pal, Sukomal
    NATURAL LANGUAGE PROCESSING, 2024,
  • [5] Zero-Shot Sentiment Analysis for Code-Mixed Data
    Yadav, Siddharth
    Chakraborty, Tanmoy
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 15941 - 15942
  • [6] Sentiment analysis leveraging emotions and word embeddings
    Giatsoglou, Maria
    Vozalis, Manolis G.
    Diamantaras, Konstantinos
    Vakali, Athena
    Sarigiannidis, George
    Chatzisavvas, Konstantinos Ch.
    EXPERT SYSTEMS WITH APPLICATIONS, 2017, 69 : 214 - 224
  • [7] Bilingual Sentiment Analysis for a Code-mixed Punjabi English Social Media Text
    Yadav, Konark
    Lamba, Aashish
    Gupta, Dhruv
    Gupta, Ansh
    Karmakar, Purnendu
    Saini, Sandeep
    PROCEEDINGS OF THE 2020 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND SECURITY (ICCCS-2020), 2020,
  • [8] An analysis of machine learning models for sentiment analysis of Tamil code-mixed data
    Shanmugavadivel, Kogilavani
    Sampath, Sai Haritha
    Nandhakumar, Pramod
    Mahalingam, Prasath
    Subramanian, Malliga
    Kumaresan, Prasanna Kumar
    Priyadharshini, Ruba
    COMPUTER SPEECH AND LANGUAGE, 2022, 76
  • [9] Transformer based multilingual joint learning framework for code-mixed and english sentiment analysis
    Mamta
    Ekbal, Asif
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024, 62 (01) : 231 - 253
  • [10] Sentiment Analysis of Code-Mixed Text: A Comprehensive Review
    Perera, Anne
    Caldera, Amitha
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2024, 30 (02) : 242 - 261