HinPLMs: Pre-trained Language Models for Hindi

被引:1
|
作者
Huang, Xixuan [1 ]
Lin, Nankai [1 ]
Li, Kexin [1 ]
Wang, Lianxi [1 ,2 ]
Gan, Suifu [3 ]
机构
[1] Guangdong Univ Foreign Studies, Sch Informat Sci & Technol, Guangzhou, Peoples R China
[2] Guangdong Univ Foreign Studies, Guangzhou Key Lab Multilingual Intelligent Proc, Guangzhou, Peoples R China
[3] Jinan Univ, Sch Management, Guangzhou, Peoples R China
关键词
Hindi Language Processing; Pre-trained Models; Corpus Construction; Romanization;
D O I
10.1109/IALP54817.2021.9675194
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It has been shown that the use of pre-trained models (PTMs) can significantly improve the performance of natural language processing (NLP) tasks for language with rich resources, and also reduce the amount of labeled sample data required in supervised learning. However, there are still few research and shared task datasets available for Hindi, and PTMs for the Romanized Hindi script has been rarely released. In this work, we construct a Hindi pre-training corpus in Devanagari and Romanized scripts, and train Hindi pre-trained models with two versions: Hindi-Devanagari-Roberta and Hindi-Romanized-Roberta. We evaluate our model on 5 types of downstream NLP tasks with 8 datasets, and compare them with existing Hindi pre-training models and commonly used methods. Experimental results show that the model proposed in this work can achieve the best results on the all tasks, especially on Part-of-Speech Tagging and Named Entity Recognition tasks, which proves the validity and superiority of our Hindi pre-trained models. Specifically, the performance of Devanagari Hindi pretrained model is better than the Romanized Hindi pre-trained model in the tasks of single-label Text Classification, Part-of-Speech Tagging, Named Entity Recognition, and Natural Language Inference. However, Romanized Hindi pre-trained model performs better in multi-label Text Classification and Machine Reading Comprehension, which may indicate that the pre-trained model of Romanized Hindi script has advantages in such tasks. We will publish our model to the community with the intention of promoting the future development of Hindi NLP.
引用
收藏
页码:241 / 246
页数:6
相关论文
共 50 条
  • [31] Evaluating the Summarization Comprehension of Pre-Trained Language Models
    Chernyshev, D. I.
    Dobrov, B. V.
    LOBACHEVSKII JOURNAL OF MATHEMATICS, 2023, 44 (08) : 3028 - 3039
  • [32] Pre-trained language models: What do they know?
    Guimaraes, Nuno
    Campos, Ricardo
    Jorge, Alipio
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2024, 14 (01)
  • [33] Capturing Semantics for Imputation with Pre-trained Language Models
    Mei, Yinan
    Song, Shaoxu
    Fang, Chenguang
    Yang, Haifeng
    Fang, Jingyun
    Long, Jiang
    2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 61 - 72
  • [34] Empowering News Recommendation with Pre-trained Language Models
    Wu, Chuhan
    Wu, Fangzhao
    Qi, Tao
    Huang, Yongfeng
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1652 - 1656
  • [35] Understanding Online Attitudes with Pre-Trained Language Models
    Power, William
    Obradovic, Zoran
    PROCEEDINGS OF THE 2023 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING, ASONAM 2023, 2023, : 745 - 752
  • [36] Memorisation versus Generalisation in Pre-trained Language Models
    Tanzer, Michael
    Ruder, Sebastian
    Rei, Marek
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7564 - 7578
  • [37] Context Analysis for Pre-trained Masked Language Models
    Lai, Yi-An
    Lalwani, Garima
    Zhang, Yi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3789 - 3804
  • [38] Exploring Lottery Prompts for Pre-trained Language Models
    Chen, Yulin
    Ding, Ning
    Wang, Xiaobin
    Hu, Shengding
    Zheng, Hai-Tao
    Liu, Zhiyuan
    Xie, Pengjun
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15428 - 15444
  • [39] On the Sentence Embeddings from Pre-trained Language Models
    Li, Bohan
    Zhou, Hao
    He, Junxian
    Wang, Mingxuan
    Yang, Yiming
    Li, Lei
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 9119 - 9130
  • [40] Compressing Pre-trained Language Models by Matrix Decomposition
    Ben Noach, Matan
    Goldberg, Yoav
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 884 - 889