Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

被引:1
|
作者
Mi, Chenggang [1 ]
Zhu, Shaolin [2 ]
Nie, Rui [3 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China
[2] Zhengzhou Univ Light Ind, Coll Software Engn, Zhengzhou, Peoples R China
[3] Chinese Flight Test Estab, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Computational linguistics - Natural language processing systems;
D O I
10.1155/2021/9975078
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] SYNTHETIC DATA AUGMENTATION FOR IMPROVING LOW-RESOURCE ASR
    Thai, Bao
    Jimerson, Robert
    Arcoraci, Dominic
    Prud'hommeaux, Emily
    Ptucha, Raymond
    2019 IEEE WESTERN NEW YORK IMAGE AND SIGNAL PROCESSING WORKSHOP (WNYISPW), 2019,
  • [2] Loanword Identification in Low-Resource Languages with Minimal Supervision
    Mi, Chenggang
    Xie, Lei
    Zhang, Yanning
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (03)
  • [3] Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
    Sorokin, Alexey
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3978 - 3983
  • [4] A Study on Low-resource Language Identification
    Qi, Zhaodi
    Ma, Yong
    Gu, Mingliang
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1897 - 1902
  • [5] Combining Simple but Novel Data Augmentation Methods for Improving Low-Resource ASR
    Damania, Ronit
    Homan, Christopher
    Prud'hommeaux, Emily
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, 2022-September : 4890 - 4894
  • [6] Combining Simple but Novel Data Augmentation Methods for Improving Low-Resource ASR
    Damania, Ronit
    Homan, Christopher
    Prud'hommeaux, Emily
    INTERSPEECH 2022, 2022, : 4890 - 4894
  • [7] Improving Low-resource Named Entity Recognition with Graph Propagated Data Augmentation
    Cai, Jiong
    Huang, Shen
    Jiang, Yong
    Tan, Zeqi
    Xie, Pengjun
    Tu, Kewei
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 110 - 118
  • [8] MELM: Data Augmentation with Masked Entity Language Modeling for Low-Resource NER
    Zhou, Ran
    Li, Xin
    He, Ruidan
    Bing, Lidong
    Cambria, Erik
    Si, Luo
    Miao, Chunyan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2251 - 2262
  • [9] Multimodal Seed Data Augmentation for Low-Resource Audio Latin Cuengh Language
    Jiang, Lanlan
    Qin, Xingguo
    Zhang, Jingwei
    Li, Jun
    APPLIED SCIENCES-BASEL, 2024, 14 (20):
  • [10] Generalized Data Augmentation for Low-Resource Translation
    Xia, Mengzhou
    Kong, Xiang
    Anastasopoulos, Antonios
    Neubig, Graham
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5786 - 5796