An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition

被引:2
|
作者
Hoxha, Klesti [1 ]
Baxhaku, Artur [1 ]
机构
[1] Univ Tirana, Fac Nat Sci, Tirana 1001, Albania
关键词
Named entity recognition; natural language processing; language corpora; semi-automatic annotation; information extraction;
D O I
10.2478/cait-2018-0009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Named Entity Recognition (NER) is an important task in many NLP pipelines. It has become especially important for knowledge bases that power many of the nowadays information retrieval systems. In order to cope with the high demand for annotated training corpora for supervised NER systems, automatic generation approaches have been proposed. In this paper we report on the first automatically generated NE annotated corpus for Albanian. News articles from Albanian news media were used as a document source. They were automatically tagged using a custom generated gazetteer from the Albanian Wikipedia. Our evaluation results show that this corpus can be used as a baseline corpus for human annotated ones or as a training corpus where no other is available.
引用
收藏
页码:95 / 108
页数:14
相关论文
共 50 条
  • [1] Assessment of disease named entity recognition on a corpus of annotated sentences
    Jimeno, Antonio
    Jimenez-Ruiz, Ernesto
    Lee, Vivian
    Gaudan, Sylvain
    Berlanga, Rafael
    Rebholz-Schuhmann, Dietrich
    BMC BIOINFORMATICS, 2008, 9 (Suppl 3)
  • [2] Assessment of disease named entity recognition on a corpus of annotated sentences
    Antonio Jimeno
    Ernesto Jimenez-Ruiz
    Vivian Lee
    Sylvain Gaudan
    Rafael Berlanga
    Dietrich Rebholz-Schuhmann
    BMC Bioinformatics, 9
  • [3] A Named Entity Recognition Approach for Albanian
    Skenduli, Marjana Prifti
    Biba, Marenglen
    2013 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2013, : 1532 - 1537
  • [4] Named Entity Recognition for Partially Annotated Datasets
    Strobl, Michael
    Trabelsi, Amine
    Zaiane, Osmar
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2022), 2022, 13286 : 299 - 306
  • [5] Development of a Hindi Named Entity Recognition System without Using Manually Annotated Training Corpus
    Saha, Sujan Kumar
    Majumder, Mukta
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2018, 15 (06) : 1088 - 1098
  • [6] A Twitter Corpus for Named Entity Recognition in Turkish
    Carik, Buse
    Yeniterzi, Reyyan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4546 - 4551
  • [7] Thai Nested Named Entity Recognition Corpus
    Buaphet, Weerayut
    Udomcharoenchaikit, Can
    Limkonchotiwat, Peerat
    Rutherford, Attapol T.
    Nutanong, Sarana
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1473 - 1486
  • [8] A Finnish news corpus for named entity recognition
    Teemu Ruokolainen
    Pekka Kauppinen
    Miikka Silfverberg
    Krister Lindén
    Language Resources and Evaluation, 2020, 54 : 247 - 272
  • [9] A Finnish news corpus for named entity recognition
    Ruokolainen, Teemu
    Kauppinen, Pekka
    Silfverberg, Miikka
    Linden, Krister
    LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (01) : 247 - 272
  • [10] Building a Named Entity Annotated Bilingual English-Vietnamese Corpus
    Tuan-An Dao
    Hung-Thinh Truong
    Long Nguyen
    Dien Dinh
    PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 61 - 66