Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language

被引：35

作者：

Das, Arjun ^{[1
]}

Ganguly, Debasis ^{[2
]}

Garain, Utpal ^{[3
]}

机构：

[1] Univ Calcutta, Dept Comp Sci & Engn, JD 2,Sect 3, Kolkata 700106, India

[2] Dublin City Univ, ADAPT Ctr, Sch Comp, Dublin, Ireland

[3] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, 203 BT Rd, Kolkata 700108, India

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2017年 / 16卷 / 03期

基金：

爱尔兰科学基金会;

关键词：

Design; Algorithms; Performance; Word embedding; CRF-based NER; Wikipedia-based NER; unsupervised NER; language-independent NER; classifier;

D O I：

10.1145/3015467

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this article, we propose a word embedding-based named entity recognition (NER) approach. NER is commonly approached as a sequence labeling task with the application of methods such as conditional random field (CRF). However, for low-resource languages without the presence of sufficiently large training data, methods such as CRF do not perform well. In our work, we make use of the proximity of the vector embeddings ofwords to approach the NER problem. The hypothesis is that word vectors belonging to the same name category, such as a person's name, occur in close vicinity in the abstract vector space of the embedded words. Assuming that this clustering hypothesis is true, we apply a standard classification approach on the vectors of words to learn a decision boundary between the NER classes. Our NER experiments are conducted on a morphologically rich and low-resource language, namely Bengali. Our approach significantly outperforms standard baseline CRF approaches that use cluster labels of word embeddings and gazetteers constructed from Wikipedia. Further, we propose an unsupervised approach (that uses an automatically created named entity (NE) gazetteer from Wikipedia in the absence of training data). For a low-resource language, the word vectors obtained from Wikipedia are not sufficient to train a classifier. As a result, we propose to make use of the distance measure between the vector embeddings of words to expand the set of Wikipedia training examples with additional NEs extracted from a monolingual corpus that yield significant improvement in the unsupervised NER performance. In fact, our expansion method performs better than the traditional CRF-based (supervised) approach (i.e., F-score of 65.4% vs. 64.2%). Finally, we compare our proposed approach to the official submission for the IJCNLP-2008 Bengali NER shared task and achieve an overall improvement of F-score 11.26% with respect to the best official system.

引用

页数：19

共 50 条

[1] A Word Representation to Improve Named Entity Recognition in Low-resource Languages
Mbouopda, Michael Franklin
Yonta, Paulin Melatagia
2019 SIXTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORKS ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), 2019, : 333 - 337
[2] AUC Maximization for Low-Resource Named Entity Recognition
Nguyen, Ngoc Dang
Tan, Wei
Du, Lan
Buntine, Wray
Beare, Richard
Chen, Changyou
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13389 - 13399
[3] Biomedical Named Entity Recognition Under Low-Resource Situation
Zhao, Jianfei
Ren, Xiangyu
Zhao, Shuo
Li, Jinyi
HEALTH INFORMATION PROCESSING. EVALUATION TRACK PAPERS, 2023, 1773 : 41 - 47
[4] Named-Entity Recognition for a Low-resource Language using Pre-Trained Language Model
Yohannes, Hailemariam Mehari
Amagasa, Toshiyuki
37TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2022, : 837 - 844
[5] Exploring Bilingual Word Embeddings for Hiligaynon, a Low-Resource Language
Michel, Leah
Hangya, Viktor
Fraser, Alexander
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2573 - 2580
[6] Named Entity Recognition Only from Word Embeddings
Luo, Ying
Zhao, Hai
Zhan, Junlang
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 8995 - 9005
[7] Combining Word Embeddings for Portuguese Named Entity Recognition
da Silva, Messias Gomes
Alves de Oliveira, Hilario Tomaz
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 198 - 208
[8] Converse Attention Knowledge Transfer for Low-Resource Named Entity Recognition
School of Computer Science and Technology, University of Science and Technology of China, Hefei
230027, China
不详
639798, Singapore
Int. J. Crowd. Sci., 2024, 3 (140-148):
[9] Dual Adversarial Neural Transfer for Low-Resource Named Entity Recognition
Zhou, Joey Tianyi
Zhang, Hao
Jin, Di
Zhu, Hongyuan
Fang, Meng
Goh, Rick Siow Mong
Kwok, Kenneth
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3461 - 3471
[10] Knowledge-Enriched Prompt for Low-Resource Named Entity Recognition
Hou, Wenlong
Zhao, Weidong
Liu, Xianhui
Guo, Wenyan
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)

← 1 2 3 4 5 →