Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

被引:1
|
作者
Mi, Chenggang [1 ]
Zhu, Shaolin [2 ]
Nie, Rui [3 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China
[2] Zhengzhou Univ Light Ind, Coll Software Engn, Zhengzhou, Peoples R China
[3] Chinese Flight Test Estab, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Computational linguistics - Natural language processing systems;
D O I
10.1155/2021/9975078
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding
    Maimaiti, Mieradilijiang
    Liu, Yang
    Luan, Huanbo
    Pan, Zegao
    Sun, Maosong
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (06)
  • [22] Multilingual Offensive Language Identification for Low-resource Languages
    Ranasinghe, Tharindu
    Zampieri, Marcos
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [23] MIXSPEECH: DATA AUGMENTATION FOR LOW-RESOURCE AUTOMATIC SPEECH RECOGNITION
    Meng, Linghui
    Xu, Jin
    Tan, Xu
    Wang, Jindong
    Qin, Tao
    Xu, Bo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7008 - 7012
  • [24] Data augmentation for low-resource grapheme-to-phoneme mapping
    Hammond, Michael
    SIGMORPHON 2021: 18TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS, PHONOLOGY, AND MORPHOLOGY, 2021, : 126 - 130
  • [25] Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution
    Nguyen, Toan Q.
    Murray, Kenton
    Chiang, David
    IWSLT 2021: THE 18TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION, 2021, : 287 - 293
  • [26] DALE: Generative Data Augmentation for Low-Resource Legal NLP
    Ghosh, Sreyan
    Evuru, Chandra Kiran
    Kumar, Sonal
    Ramaneswaran, S.
    Sakshi, S.
    Tyagi, Utkarsh
    Manocha, Dinesh
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8511 - 8565
  • [27] Language fusion via adapters for low-resource speech recognition
    Hu, Qing
    Zhang, Yan
    Zhang, Xianlei
    Han, Zongyu
    Liang, Xiuxia
    SPEECH COMMUNICATION, 2024, 158
  • [28] IMPROVING DATA SELECTION FOR LOW-RESOURCE STT AND KWS
    Fraga-Silva, Thiago
    Laurent, Antoine
    Gauvain, Jean-Luc
    Lamel, Lori
    Le, Viet-Bac
    Messaoudi, Abdel
    2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 153 - 159
  • [29] IMPROVING HUMAN-COMPUTER INTERACTION IN LOW-RESOURCE SETTINGS WITH TEXT-TO-PHONETIC DATA AUGMENTATION
    Stiff, Adam
    Serai, Prashant
    Fosler-Lussier, Eric
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7320 - 7324
  • [30] Data Selection using Spoken Language Identification for Low-Resource and Zero-Resource Speech Recognition
    Chen, Jianan
    Chu, Chenhui
    Li, Sheng
    Kawahara, Tatsuya
    APSIPA ASC 2024 - Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2024, 2024,