AsNER - Annotated Dataset and Baseline for Assamese Named Entity recognition

被引:0
|
作者
Pathak, Dhrubajyoti [1 ]
Nandi, Sukumar [1 ]
Sarmah, Priyankoo [1 ]
机构
[1] Indian Inst Technol Guwahati, North Guwahati, India
关键词
NER dataset; Language Resources; Assamese NER; Assamese Language; Named Entity Recognition; NER model; AsNER;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.
引用
收藏
页码:6571 / 6577
页数:7
相关论文
共 50 条
  • [1] DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
    Moussa, Hanane Nour
    Mourhir, Asmaa
    DATA IN BRIEF, 2023, 48
  • [2] Named Entity Recognition in Assamese: A Hybrid Approach
    Sharma, Padmaja
    Sharma, Utpal
    Kalita, Jugal
    2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 2114 - 2120
  • [3] Supervised Named Entity Recognition in Assamese language
    Talukdar, Gitimoni
    Borah, Pranjal Protim
    Baruah, Arup
    2014 INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING AND INFORMATICS (IC3I), 2014, : 187 - 191
  • [4] Named Entity Recognition In Assamese using CRFs and Rules
    Sharma, Padmaja
    Sharma, Utpal
    Kalita, Jugal
    PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2014), 2014, : 15 - 18
  • [5] A Named Entity Recognition Dataset for Turkish
    Kucuk, Dilek
    Kucuk, Dogan
    Arici, Nursal
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 329 - 332
  • [6] Named Entity Recognition for Partially Annotated Datasets
    Strobl, Michael
    Trabelsi, Amine
    Zaiane, Osmar
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2022), 2022, 13286 : 299 - 306
  • [7] KazNERD: Kazakh Named Entity Recognition Dataset
    Yeshpanov, Rustem
    Khassanov, Yerbolat
    Varol, Huseyin Atakan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 417 - 426
  • [8] DroNER: Dataset for drone named entity recognition
    Silalahi, Swardiantara
    Ahmad, Tohari
    Studiawan, Hudan
    DATA IN BRIEF, 2023, 48
  • [9] Creating a Dataset for Named Entity Recognition in the Archaeology Domain
    Brandsen, Alex
    Verberne, Suzan
    Wansleeben, Milco
    Lambers, Karsten
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4573 - 4577
  • [10] SciCN: A Scientific Dataset for Chinese Named Entity Recognition
    Yang, Jing
    Ji, Bin
    Li, Shasha
    Ma, Jun
    Yu, Jie
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (03): : 4303 - 4315