AsNER - Annotated Dataset and Baseline for Assamese Named Entity recognition

被引:0
|
作者
Pathak, Dhrubajyoti [1 ]
Nandi, Sukumar [1 ]
Sarmah, Priyankoo [1 ]
机构
[1] Indian Inst Technol Guwahati, North Guwahati, India
关键词
NER dataset; Language Resources; Assamese NER; Assamese Language; Named Entity Recognition; NER model; AsNER;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.
引用
收藏
页码:6571 / 6577
页数:7
相关论文
共 50 条
  • [21] EduNER: a Chinese named entity recognition dataset for education research
    Xu Li
    Chengkun Wei
    Zhuoren Jiang
    Wenlong Meng
    Fan Ouyang
    Zihui Zhang
    Wenzhi Chen
    Neural Computing and Applications, 2023, 35 : 17717 - 17731
  • [22] NNE: A Dataset for Nested Named Entity Recognition in English Newswire
    Ringland, Nicky
    Dai, Xiang
    Hachey, Ben
    Karimi, Sarvnaz
    Paris, Cecile
    Curran, James R.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5176 - 5181
  • [23] Interpretable Multi-dataset Evaluation for Named Entity Recognition
    Fu, Jinlan
    Liu, Pengfei
    Neubig, Graham
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6058 - 6069
  • [24] EduNER: a Chinese named entity recognition dataset for education research
    Li, Xu
    Wei, Chengkun
    Jiang, Zhuoren
    Meng, Wenlong
    Ouyang, Fan
    Zhang, Zihui
    Chen, Wenzhi
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (24): : 17717 - 17731
  • [25] Statistical dataset evaluation: A case study on named entity recognition
    Wang, Chengwen
    Dong, Qingxiu
    Wang, Xiaochen
    Sui, Zhifang
    NATURAL LANGUAGE PROCESSING, 2024,
  • [26] B-NER: A Novel Bangla Named Entity Recognition Dataset With Largest Entities and Its Baseline Evaluation
    Haque, Md. Zahidul
    Zaman, Sakib
    Saurav, Jillur Rahman
    Haque, Summit
    Islam, Md. Saiful
    Amin, Mohammad Ruhul
    IEEE ACCESS, 2023, 11 : 45194 - 45205
  • [27] An Embarrassingly Easy but Strong Baseline for Nested Named Entity Recognition
    Yan, Hang
    Sun, Yu
    Li, Xiaonan
    Qiu, Xipeng
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1442 - 1452
  • [28] Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition
    Baviskar, Dipali
    Ahirrao, Swati
    Kotecha, Ketan
    DATA, 2021, 6 (07)
  • [29] A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers
    Hamdi, Ahmed
    Pontes, Elvys Linhares
    Boros, Emanuela
    Thi Tuyet Hai Nguyen
    Hackl, Guenter
    Moreno, Jose G.
    Doucet, Antoine
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2328 - 2334
  • [30] Dataset Enhancement and Multilingual Transfer for Named Entity Recognition in the Indonesian Language
    Khairunnisa, Siti Oryza
    Chen, Zhousi
    Komachi, Mamoru
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)