HinPLMs: Pre-trained Language Models for Hindi

被引：1

作者：

Huang, Xixuan ^{[1
]}

Lin, Nankai ^{[1
]}

Li, Kexin ^{[1
]}

Wang, Lianxi ^{[1
,2
]}

Gan, Suifu ^{[3
]}

机构：

[1] Guangdong Univ Foreign Studies, Sch Informat Sci & Technol, Guangzhou, Peoples R China

[2] Guangdong Univ Foreign Studies, Guangzhou Key Lab Multilingual Intelligent Proc, Guangzhou, Peoples R China

[3] Jinan Univ, Sch Management, Guangzhou, Peoples R China

来源：

2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP) | 2021年

关键词：

Hindi Language Processing; Pre-trained Models; Corpus Construction; Romanization;

D O I：

10.1109/IALP54817.2021.9675194

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It has been shown that the use of pre-trained models (PTMs) can significantly improve the performance of natural language processing (NLP) tasks for language with rich resources, and also reduce the amount of labeled sample data required in supervised learning. However, there are still few research and shared task datasets available for Hindi, and PTMs for the Romanized Hindi script has been rarely released. In this work, we construct a Hindi pre-training corpus in Devanagari and Romanized scripts, and train Hindi pre-trained models with two versions: Hindi-Devanagari-Roberta and Hindi-Romanized-Roberta. We evaluate our model on 5 types of downstream NLP tasks with 8 datasets, and compare them with existing Hindi pre-training models and commonly used methods. Experimental results show that the model proposed in this work can achieve the best results on the all tasks, especially on Part-of-Speech Tagging and Named Entity Recognition tasks, which proves the validity and superiority of our Hindi pre-trained models. Specifically, the performance of Devanagari Hindi pretrained model is better than the Romanized Hindi pre-trained model in the tasks of single-label Text Classification, Part-of-Speech Tagging, Named Entity Recognition, and Natural Language Inference. However, Romanized Hindi pre-trained model performs better in multi-label Text Classification and Machine Reading Comprehension, which may indicate that the pre-trained model of Romanized Hindi script has advantages in such tasks. We will publish our model to the community with the intention of promoting the future development of Hindi NLP.

引用

页码：241 / 246

页数：6

共 50 条

[21] Prompt Tuning for Discriminative Pre-trained Language Models
Yao, Yuan
Dong, Bowen
Zhang, Ao
Zhang, Zhengyan
Xie, Ruobing
Liu, Zhiyuan
Lin, Leyu
Sun, Maosong
Wang, Jianyong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3468 - 3473
[22] A Close Look into the Calibration of Pre-trained Language Models
Chen, Yangyi
Yuan, Lifan
Cui, Ganqu
Liu, Zhiyuan
Ji, Heng
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1343 - 1367
[23] Deep Entity Matching with Pre-Trained Language Models
Li, Yuliang
Li, Jinfeng
Suhara, Yoshihiko
Doan, AnHai
Tan, Wang-Chiew
PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (01): : 50 - 60
[24] InA: Inhibition Adaption on pre-trained language models
Kang, Cheng
Prokop, Jindrich
Tong, Lei
Zhou, Huiyu
Hu, Yong
Novak, Daniel
NEURAL NETWORKS, 2024, 178
[25] Leveraging Pre-trained Language Models for Gender Debiasing
Jain, Nishtha
Popovic, Maja
Groves, Declan
Specia, Lucia
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2188 - 2195
[26] A Survey of Knowledge Enhanced Pre-Trained Language Models
Hu, Linmei
Liu, Zeyi
Zhao, Ziwang
Hou, Lei
Nie, Liqiang
Li, Juanzi
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (04) : 1413 - 1430
[27] Exploring Robust Overfitting for Pre-trained Language Models
Zhu, Bin
Rao, Yanghui
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5506 - 5522
[28] Self-conditioning Pre-Trained Language Models
Suau, Xavier
Zappella, Luca
Apostoloff, Nicholas
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[29] Commonsense Knowledge Transfer for Pre-trained Language Models
Zhou, Wangchunshu
Le Bras, Ronan
Choi, Yejin
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5946 - 5960
[30] Pre-trained models for natural language processing: A survey
QIU XiPeng
SUN TianXiang
XU YiGe
SHAO YunFan
DAI Ning
HUANG XuanJing
Science China(Technological Sciences), 2020, 63 (10) : 1872 - 1897

← 1 2 3 4 5 →