Shahmukhi named entity recognition by using contextualized word embeddings

被引:2
|
作者
Tehseen, Amina [1 ]
Ehsan, Toqeer [2 ]
Bin Liaqat, Hannan [3 ]
Kong, Xiangjie [4 ]
Ali, Amjad [5 ]
Al-Fuqaha, Ala [5 ]
机构
[1] Univ Gujrat, Dept Informat Technol, Gujrat 50700, Pakistan
[2] Univ Gujrat, Dept Comp Sci, Gujrat 50700, Pakistan
[3] Univ Educ, Dept Informat Technol, Div Sci & Technol, Township Campus, Lahore 54000, Pakistan
[4] Zhejiang Univ Technol, Coll Comp Sci & Technol, Hangzhou 310023, Peoples R China
[5] Hamad Bin Khalifa Univ, Coll Sci & Engn CSE, Informat & Comp Technol ICT Div, Doha, Qatar
基金
中国国家自然科学基金;
关键词
Shahmukhi; Punjabi; Named entity recognition; Neural networks; ELMo;
D O I
10.1016/j.eswa.2023.120489
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Named Entity Recognition (NER) is an imperative Natural Language Processing (NLP) task which intents to identify and classify predefined named entities in a given span of text. For many Western and Asian languages, NER is a systematically premeditated and established task, however, a little work has been done for Shahmukhi. This paper presents Shahmukhi NER with four key contributions. First, a Bi-directional Long -Short Term Memory (BiLSTM) network based NER model has been developed by incorporating various features including character and word embeddings and Part of Speech (POS) tagging. Second, transfer learning has been employed by training context-free Word2Vec and contextualized Embeddings from Language Models (ELMo) word representations. The word representations have been trained using a Shahmukhi corpus of 14.9 million words. Third, we prepared a cleaner version of an existing Shahmukhi NER corpus by performing Unicode normalization and tokenization errands. The corpus has been deduplicated and results are reported on an unseen evaluation set which produced valid results. Fourth, we have studied the impact of two annotation schemes; Inside-Outside (IO) and Inside-Outside-Beginning (IOB) for Shahmukhi. Transfer learning was quite helpful to enhance the performance of NER models especially ELMo embeddings significantly improved the results by prompting contextualized embedding vectors. This is the first study to use character embeddings, POS tagging and transfer learning for Shahmukhi named entity recognition. The IO scheme based model achieved an accuracy of 98.60% with an f-score of 83.75. The IOB scheme based model performed with an accuracy of 98.43% and an f-score of 75.55. These scores are quite promising for an under-resourced morphologically-rich language.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings
    Zhai, Zenan
    Dat Quoc Nguyen
    Akhondi, Saber A.
    Thorne, Camilo
    Druckenbrodt, Christian
    Cohn, Trevor
    Gregory, Michelle
    Verspoor, Karin
    [J]. SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2019), 2019, : 328 - 338
  • [2] Pooled Contextualized Embeddings for Named Entity Recognition
    Akbik, Alan
    Bergmann, Tanja
    Vollgraf, Roland
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 724 - 728
  • [3] Chemical Named Entity Recognition with Deep Contextualized Neural Embeddings
    Awan, Zainab
    Kahlke, Tim
    Ralph, Peter J.
    Kennedy, Paul J.
    [J]. KDIR: PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL 1: KDIR, 2019, : 135 - 144
  • [4] Named Entity Recognition Only from Word Embeddings
    Luo, Ying
    Zhao, Hai
    Zhan, Junlang
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 8995 - 9005
  • [5] Combining Word Embeddings for Portuguese Named Entity Recognition
    da Silva, Messias Gomes
    Alves de Oliveira, Hilario Tomaz
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 198 - 208
  • [6] Named Entity Recognition and Classification for Punjabi Shahmukhi
    Ahmad, Muhammad Tayyab
    Malik, Muhammad Kamran
    Shahzad, Khurram
    Aslam, Faisal
    Iqbal, Asif
    Nawaz, Zubair
    Bukhari, Faisal
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (04)
  • [7] Geographic Named Entity Recognition and Disambiguation in Mexican News using word embeddings
    Molina-Villegas, Alejandro
    Muniz-Sanchez, Victor
    Arreola-Trapala, Jean
    Alcantara, Filomeno
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 176
  • [8] Improving Named Entity Recognition for Morphologically Rich Languages using Word Embeddings
    Demir, Hakan
    Ozgur, Arzucan
    [J]. 2014 13TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2014, : 117 - 122
  • [9] Combining Contextualized Embeddings and Prior Knowledge for Clinical Named Entity Recognition: Evaluation Study
    Jiang, Min
    Sanger, Todd
    Liu, Xiong
    [J]. JMIR MEDICAL INFORMATICS, 2019, 7 (04) : 80 - 94
  • [10] A deep neural framework for named entity recognition with boosted word embeddings
    Goyal, Archana
    Gupta, Vishal
    Kumar, Manish
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 15533 - 15546