Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language

被引:35
|
作者
Das, Arjun [1 ]
Ganguly, Debasis [2 ]
Garain, Utpal [3 ]
机构
[1] Univ Calcutta, Dept Comp Sci & Engn, JD 2,Sect 3, Kolkata 700106, India
[2] Dublin City Univ, ADAPT Ctr, Sch Comp, Dublin, Ireland
[3] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, 203 BT Rd, Kolkata 700108, India
基金
爱尔兰科学基金会;
关键词
Design; Algorithms; Performance; Word embedding; CRF-based NER; Wikipedia-based NER; unsupervised NER; language-independent NER; classifier;
D O I
10.1145/3015467
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this article, we propose a word embedding-based named entity recognition (NER) approach. NER is commonly approached as a sequence labeling task with the application of methods such as conditional random field (CRF). However, for low-resource languages without the presence of sufficiently large training data, methods such as CRF do not perform well. In our work, we make use of the proximity of the vector embeddings ofwords to approach the NER problem. The hypothesis is that word vectors belonging to the same name category, such as a person's name, occur in close vicinity in the abstract vector space of the embedded words. Assuming that this clustering hypothesis is true, we apply a standard classification approach on the vectors of words to learn a decision boundary between the NER classes. Our NER experiments are conducted on a morphologically rich and low-resource language, namely Bengali. Our approach significantly outperforms standard baseline CRF approaches that use cluster labels of word embeddings and gazetteers constructed from Wikipedia. Further, we propose an unsupervised approach (that uses an automatically created named entity (NE) gazetteer from Wikipedia in the absence of training data). For a low-resource language, the word vectors obtained from Wikipedia are not sufficient to train a classifier. As a result, we propose to make use of the distance measure between the vector embeddings of words to expand the set of Wikipedia training examples with additional NEs extracted from a monolingual corpus that yield significant improvement in the unsupervised NER performance. In fact, our expansion method performs better than the traditional CRF-based (supervised) approach (i.e., F-score of 65.4% vs. 64.2%). Finally, we compare our proposed approach to the official submission for the IJCNLP-2008 Bengali NER shared task and achieve an overall improvement of F-score 11.26% with respect to the best official system.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] A Word Representation to Improve Named Entity Recognition in Low-resource Languages
    Mbouopda, Michael Franklin
    Yonta, Paulin Melatagia
    2019 SIXTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORKS ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), 2019, : 333 - 337
  • [2] AUC Maximization for Low-Resource Named Entity Recognition
    Nguyen, Ngoc Dang
    Tan, Wei
    Du, Lan
    Buntine, Wray
    Beare, Richard
    Chen, Changyou
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13389 - 13399
  • [3] Biomedical Named Entity Recognition Under Low-Resource Situation
    Zhao, Jianfei
    Ren, Xiangyu
    Zhao, Shuo
    Li, Jinyi
    HEALTH INFORMATION PROCESSING. EVALUATION TRACK PAPERS, 2023, 1773 : 41 - 47
  • [4] Named-Entity Recognition for a Low-resource Language using Pre-Trained Language Model
    Yohannes, Hailemariam Mehari
    Amagasa, Toshiyuki
    37TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2022, : 837 - 844
  • [5] Exploring Bilingual Word Embeddings for Hiligaynon, a Low-Resource Language
    Michel, Leah
    Hangya, Viktor
    Fraser, Alexander
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2573 - 2580
  • [6] Named Entity Recognition Only from Word Embeddings
    Luo, Ying
    Zhao, Hai
    Zhan, Junlang
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 8995 - 9005
  • [7] Combining Word Embeddings for Portuguese Named Entity Recognition
    da Silva, Messias Gomes
    Alves de Oliveira, Hilario Tomaz
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 198 - 208
  • [8] Converse Attention Knowledge Transfer for Low-Resource Named Entity Recognition
    School of Computer Science and Technology, University of Science and Technology of China, Hefei
    230027, China
    不详
    639798, Singapore
    Int. J. Crowd. Sci., 2024, 3 (140-148):
  • [9] Dual Adversarial Neural Transfer for Low-Resource Named Entity Recognition
    Zhou, Joey Tianyi
    Zhang, Hao
    Jin, Di
    Zhu, Hongyuan
    Fang, Meng
    Goh, Rick Siow Mong
    Kwok, Kenneth
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3461 - 3471
  • [10] Knowledge-Enriched Prompt for Low-Resource Named Entity Recognition
    Hou, Wenlong
    Zhao, Weidong
    Liu, Xianhui
    Guo, Wenyan
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)