AsNER - Annotated Dataset and Baseline for Assamese Named Entity recognition

被引:0
|
作者
Pathak, Dhrubajyoti [1 ]
Nandi, Sukumar [1 ]
Sarmah, Priyankoo [1 ]
机构
[1] Indian Inst Technol Guwahati, North Guwahati, India
关键词
NER dataset; Language Resources; Assamese NER; Assamese Language; Named Entity Recognition; NER model; AsNER;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.
引用
收藏
页码:6571 / 6577
页数:7
相关论文
共 50 条
  • [31] Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation
    Mengliev, Davlatyor
    Barakhnin, Vladimir
    Abdurakhmonova, Nilufar
    Eshkulov, Mukhriddin
    DATA IN BRIEF, 2024, 54
  • [32] Research on College Academic Text Named Entity Recognition and Dataset Construction
    He, Chen
    Yuan, Yingchun
    Wang, Kejian
    Tao, Jia
    Computer Engineering and Applications, 2023, 59 (22) : 322 - 328
  • [33] CachacaNER: a dataset for named entity recognition in texts about the cachaca beverage
    Silva, Priscilla
    Franco, Arthur
    Santos, Thiago
    Brito, Mozar
    Pereira, Denilson
    LANGUAGE RESOURCES AND EVALUATION, 2023, 58 (4) : 1315 - 1333
  • [34] Few-Shot Named Entity Recognition: An Empirical Baseline Study
    Huang, Jiaxin
    Lie, Chunyuan
    Subudhi, Krishan
    Jose, Damien
    Balakrishnan, Shobana
    Chen, Weizhu
    Peng, Baolin
    Gao, Jianfeng
    Han, Jiawei
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 10408 - 10423
  • [35] An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes
    Alshammari, Nasser
    Alanazi, Saad
    DATA, 2020, 5 (03) : 1 - 8
  • [36] An Arabic dataset for disease named entity recognition with multi-annotation schemes
    Alshammari, Nasser (nashamri@ju.edu.sa), 1600, MDPI (05):
  • [37] FEW-NERD: A Few-shot Named Entity Recognition Dataset
    Ding, Ning
    Xu, Guangwei
    Chen, Yulin
    Wang, Xiaobin
    Han, Xu
    Xie, Pengjun
    Zheng, Hai-Tao
    Liu, Zhiyuan
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3198 - 3213
  • [38] KIND: an Italian Multi-Domain Dataset for Named-Entity Recognition
    Paccosi, Teresa
    Aprosio, Alessio Palmero
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 501 - 507
  • [39] DNRTI: A Large-scale Dataset for Named Entity Recognition in Threat Intelligence
    Wang, Xuren
    Liu, Xinpei
    Ao, Shengqin
    Li, Ning
    Jiang, Zhengwei
    Xu, Zongyi
    Xiong, Zihan
    Xiong, Mengbo
    Zhang, Xiaoqing
    2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020), 2020, : 1842 - 1848
  • [40] LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text
    Luz de Araujo, Pedro Henrique
    de Campos, Teofilo E.
    de Oliveira, Renato R. R.
    Stauffer, Matheus
    Couto, Samuel
    Bermejo, Paulo
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2018, 2018, 11122 : 313 - 323