Development and evaluation of a NER model in the domain of cultural analysis and tourism

被引:0
|
作者
Docio, Susana Sotelo [1 ]
Gamallo, Pablo [1 ]
Iriarte, Alvaro [2 ]
机构
[1] Univ Santiago de Compostela, Santiago, Spain
[2] Univ Minho, Braga, Portugal
来源
LINGUAMATICA | 2023年 / 15卷 / 02期
关键词
named-entity recognition; machine learning; neural networks; transformers; evaluation; CORPUS;
D O I
10.21814/lm.15.2.405
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
Named Entity Recognition (NER) is an essential task in information extraction where entities in a text are identified and classified. One of the primary chal-lenges addressed by NER systems is the difficulty of generalizing what was learned to different types of corpora beyond the training data. This problem is magnified by the fact that most of the training cor-pora used are journalistic and therefore need to be adapted to other genres and domains. In this paper, we use a Spanish corpus consisting of interviews with visitors to the city of Santiago de Compostela and annotated with named entities, to evaluate and train NER systems tailored to the domain of cultural analy-sis and tourism. We provide a comprehensive compa-rison of various approaches employed, ranging from classical machine learning algorithms to fine-tuning Transformer models. The results significantly out-perform the baseline, represented here by the toolkits Stanza, spaCy and FLAIR, although initial tests with unseen entities during training highlight the need for additional evaluations regarding their generalization capability and the utilization of adversarial splits for the corpus.
引用
收藏
页码:3 / 18
页数:16
相关论文
共 50 条