Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

被引:1
|
作者
Mi, Chenggang [1 ]
Zhu, Shaolin [2 ]
Nie, Rui [3 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China
[2] Zhengzhou Univ Light Ind, Coll Software Engn, Zhengzhou, Peoples R China
[3] Chinese Flight Test Estab, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Computational linguistics - Natural language processing systems;
D O I
10.1155/2021/9975078
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
    Liu, Qian
    Zhang, Wei-Qiang
    Liu, Jia
    Liu, Yao
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
  • [42] Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages
    Ziyaden, Atabay
    Yelenov, Amir
    Hajiyev, Fuad
    Rustamov, Samir
    Pak, Alexandr
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [43] Optical Technologies for Improving Healthcare in Low-Resource Settings: introduction to the feature issue
    Bowden, Audrey K.
    Durr, Nicholas J.
    Erickson, David
    Ozcan, Aydogan
    Ramanujam, Nirmala
    Vacas Jacques, Paulino
    BIOMEDICAL OPTICS EXPRESS, 2020, 11 (06) : 3091 - 3094
  • [44] Improving preterm newborn identification in low-resource settings with machine learning
    Rittenhouse, Katelyn J.
    Vwalika, Bellington
    Keil, Alexander
    Winston, Jennifer
    Stoner, Marie
    Price, Joan T.
    Kapasa, Monica
    Mubambe, Mulaya
    Banda, Vanilla
    Muunga, Whyson
    Stringer, Jeffrey S. A.
    PLOS ONE, 2019, 14 (02):
  • [45] Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks
    Nag, Arijit
    Samanta, Bidisha
    Mukherjee, Animesh
    Ganguly, Niloy
    Chakrabarti, Soumen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8619 - 8629
  • [46] LOW-RESOURCE LANGUAGE IDENTIFICATION FROM SPEECH USING TRANSFER LEARNING
    Feng, Kexin
    Chaspari, Theodora
    2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
  • [47] BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER
    Ghosh, Sreyan
    Tyagi, Utkarsh
    Kumar, Sonal
    Manocha, Dinesh
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1853 - 1858
  • [48] PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks
    Wang, Yufei
    Xu, Can
    Sun, Qingfeng
    Hu, Huang
    Tao, Chongyang
    Geng, Xiubo
    Jiang, Daxin
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4242 - 4255
  • [49] A Bilingual Templates Data Augmentation Method for Low-Resource Neural Machine Translation
    Li, Fuxue
    Liu, Beibei
    Yan, Hong
    Shao, Mingzhi
    Xie, Peijun
    Li, Jiarui
    Chi, Chuncheng
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14877 : 40 - 51
  • [50] Exogenous and Endogenous Data Augmentation for Low-Resource Complex Named Entity Recognition
    Zhang, Xinghua
    Chen, Gaode
    Cui, Shiyao
    Sheng, Jiawei
    Liu, Tingwen
    Xu, Hongbo
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 630 - 640