Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP Tasks in Telugu Language

被引:6
|
作者
Marreddy, Mounika [1 ]
Oota, Subba Reddy [1 ]
Vakada, Lakshmi Sireesha [1 ]
Chinni, Venkata Charan [1 ]
Mamidi, Radhika [1 ]
机构
[1] IIITH, Prof CR Rao Rd, Hyderabad, Telengana, India
关键词
BERT-Te; RoBERTa-Te; ELMo-Te; resource creation; text classification; low-resource languages; CLASSIFICATION; EMOTIONS;
D O I
10.1145/3531535
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the lack of a large annotated corpus, many resource-poor Indian languages struggle to reap the benefits of recent deep feature representations in Natural Language Processing (NLP). Moreover, adopting existing language models trained on large English corpora for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. In this paper, we explore the traditional to recent efficient representations to overcome the challenges of a low resource language, Telugu. In particular, our main objective is to mitigate the low-resource problem for Telugu. Overall, we present several contributions to a resource-poor language viz. Telugu. (i) a large annotated data (35,142 sentences in each task) for multiple NLP tasks such as sentiment analysis, emotion identification, hate-speech detection, and sarcasm detection, (ii) we create different lexicons for sentiment, emotion, and hate-speech for improving the efficiency of the models, (iii) pretrained word and sentence embeddings, and (iv) different pretrained language models for Telugu such as ELMo-Te, BERT-Te, RoBERTa-Te, ALBERT-Te, and DistilBERT-Te on a large Telugu corpus consisting of 8,015,588 sentences (1,637,408 sentences from TeluguWikipedia and 6,378,180 sentences crawled from different Telugu websites). Further, we show that these representations significantly improve the performance of four NLP tasks and present the benchmark results for Telugu. We argue that our pretrained embeddings are competitive or better than the existing multilingual pretrained models: mBERT, XLM-R, and IndicBERT. Lastly, the fine-tuning of pretrained models show higher performance than linear probing results on four NLP tasks with the following F1-scores: Sentiment (68.72), Emotion (58.04), Hate-Speech (64.27), and Sarcasm (77.93). We also experiment on publicly available Telugu datasets (Named Entity Recognition, Article Genre Classification, and Sentiment Analysis) and find that our Telugu pretrained language models (BERT-Te and RoBERTa-Te) outperform the state-of-the-art system except for the sentiment task. We open-source our corpus, four different datasets, lexicons, embeddings, and code https://github.com/Cha14ran/DREAM- T. The pretrained Transformer models for Telugu are available at https://huggingface.co/ltrctelugu.
引用
收藏
页数:34
相关论文
共 8 条
  • [1] A Dialogue System for Telugu, a Resource-Poor Language
    Sravanthi, Mullapudi Ch
    Prathyusha, Kuncham
    Mamidi, Radhika
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 364 - 374
  • [2] A survey on sentiment analysis in Urdu: A resource-poor language
    Khattak, Asad
    Asghar, Muhammad Zubair
    Saeed, Anam
    Hameed, Ibrahim A.
    Hassan, Syed Asif
    Ahmad, Shakeel
    EGYPTIAN INFORMATICS JOURNAL, 2021, 22 (01) : 53 - 74
  • [3] Reverse Transfer Learning: Can Word Embeddings Trained for Different NLP Tasks Improve Neural Language Models?
    Verwimp, Lyan
    Bellegarda, Jerome R.
    INTERSPEECH 2019, 2019, : 3485 - 3489
  • [4] Creating sentiment lexicon for sentiment analysis in Urdu: The case of a resource-poor language
    Asghar, Muhammad Zubair
    Sattar, Anum
    Khan, Aurangzeb
    Ali, Amjad
    Kundi, Fazal Masud
    Ahmad, Shakeel
    EXPERT SYSTEMS, 2019, 36 (03)
  • [5] From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models
    Feng, Shangbin
    Park, Chan Young
    Liu, Yuhan
    Tsvetkov, Yulia
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11737 - 11762
  • [6] Statistical Behavior analysis of smoothing methods for language models of mandarin data sets
    Yu, Ming-Shing
    Huang, Feng-Long
    Tsai, Piyu
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2006, 4182 : 172 - 186
  • [7] Sentiment Analysis Using XLM-R Transformer and Zero-shot Transfer Learning on Resource-poor Indian Language
    Kumar, Akshi
    Albuquerque, Victor Hugo C.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (05)
  • [8] Exploring the Performance of Large Language Models for Data Analysis Tasks Through the CRISP-DM Framework
    Musazade, Nurlan
    Mezei, Jozsef
    Wang, Xiaolu
    GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 5, WORLDCIST 2024, 2024, 989 : 56 - 65