HinPLMs: Pre-trained Language Models for Hindi

被引:1
|
作者
Huang, Xixuan [1 ]
Lin, Nankai [1 ]
Li, Kexin [1 ]
Wang, Lianxi [1 ,2 ]
Gan, Suifu [3 ]
机构
[1] Guangdong Univ Foreign Studies, Sch Informat Sci & Technol, Guangzhou, Peoples R China
[2] Guangdong Univ Foreign Studies, Guangzhou Key Lab Multilingual Intelligent Proc, Guangzhou, Peoples R China
[3] Jinan Univ, Sch Management, Guangzhou, Peoples R China
关键词
Hindi Language Processing; Pre-trained Models; Corpus Construction; Romanization;
D O I
10.1109/IALP54817.2021.9675194
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It has been shown that the use of pre-trained models (PTMs) can significantly improve the performance of natural language processing (NLP) tasks for language with rich resources, and also reduce the amount of labeled sample data required in supervised learning. However, there are still few research and shared task datasets available for Hindi, and PTMs for the Romanized Hindi script has been rarely released. In this work, we construct a Hindi pre-training corpus in Devanagari and Romanized scripts, and train Hindi pre-trained models with two versions: Hindi-Devanagari-Roberta and Hindi-Romanized-Roberta. We evaluate our model on 5 types of downstream NLP tasks with 8 datasets, and compare them with existing Hindi pre-training models and commonly used methods. Experimental results show that the model proposed in this work can achieve the best results on the all tasks, especially on Part-of-Speech Tagging and Named Entity Recognition tasks, which proves the validity and superiority of our Hindi pre-trained models. Specifically, the performance of Devanagari Hindi pretrained model is better than the Romanized Hindi pre-trained model in the tasks of single-label Text Classification, Part-of-Speech Tagging, Named Entity Recognition, and Natural Language Inference. However, Romanized Hindi pre-trained model performs better in multi-label Text Classification and Machine Reading Comprehension, which may indicate that the pre-trained model of Romanized Hindi script has advantages in such tasks. We will publish our model to the community with the intention of promoting the future development of Hindi NLP.
引用
收藏
页码:241 / 246
页数:6
相关论文
共 50 条
  • [1] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    [J]. ENGINEERING, 2023, 25 : 51 - 65
  • [2] Aspect-Based Sentiment Analysis in Hindi Language by Ensembling Pre-Trained mBERT Models
    Pathak, Abhilash
    Kumar, Sudhanshu
    Roy, Partha Pratim
    Kim, Byung-Gyu
    [J]. ELECTRONICS, 2021, 10 (21)
  • [3] Annotating Columns with Pre-trained Language Models
    Suhara, Yoshihiko
    Li, Jinfeng
    Li, Yuliang
    Zhang, Dan
    Demiralp, Cagatay
    Chen, Chen
    Tan, Wang-Chiew
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1493 - 1503
  • [4] LaoPLM: Pre-trained Language Models for Lao
    Lin, Nankai
    Fu, Yingwen
    Yang, Ziyu
    Chen, Chuwei
    Jiang, Shengyi
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6506 - 6512
  • [5] PhoBERT: Pre-trained language models for Vietnamese
    Dat Quoc Nguyen
    Anh Tuan Nguyen
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1037 - 1042
  • [6] Knowledge Inheritance for Pre-trained Language Models
    Qin, Yujia
    Lin, Yankai
    Yi, Jing
    Zhang, Jiajie
    Han, Xu
    Zhang, Zhengyan
    Su, Yusheng
    Liu, Zhiyuan
    Li, Peng
    Sun, Maosong
    Zhou, Jie
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3921 - 3937
  • [7] Evaluating Commonsense in Pre-Trained Language Models
    Zhou, Xuhui
    Zhang, Yue
    Cui, Leyang
    Huang, Dandan
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9733 - 9740
  • [8] Pre-trained language models in medicine: A survey *
    Luo, Xudong
    Deng, Zhiqi
    Yang, Binxia
    Luo, Michael Y.
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 154
  • [9] Probing for Hyperbole in Pre-Trained Language Models
    Schneidermann, Nina Skovgaard
    Hershcovich, Daniel
    Pedersen, Bolette Sandford
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-SRW 2023, VOL 4, 2023, : 200 - 211
  • [10] A Study of Pre-trained Language Models in Natural Language Processing
    Duan, Jiajia
    Zhao, Hui
    Zhou, Qian
    Qiu, Meikang
    Liu, Meiqin
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON SMART CLOUD (SMARTCLOUD 2020), 2020, : 116 - 121