Deep Learning-based Sequence Labeling Tools for Nepali

被引:1
|
作者
Rai, Pooja [1 ,2 ]
Chatterji, Sanjay [3 ]
Kim, Byung-Gyu [4 ]
机构
[1] Indian Inst Informat Technol Kalyani, Kalyani 741235, West Bengal, India
[2] New Alipore Coll, Kolkata 700053, West Bengal, India
[3] Indian Inst Informat Technol Kalyani, Kalyani, West Bengal, India
[4] Sookmyung Womens Univ, Seoul 04310, South Korea
关键词
Deep learning-based Nepali tools; Nepali sequence labeling tools; Nepali chunker; BI-LSTM-CRF neural network; Nepali text feature selection; Nepali optimum feature set;
D O I
10.1145/3606696
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A Part-of-Speech (POS) tagger and Chunker (or shallow parser) are sequence labeling tools, crucial for improving the accuracy of Natural Language Processing (NLP) tasks like parsing, named entity recognition, sentiment analysis, information extraction, and so on. Developing such tools for a low-resource language is an arduous task. Nepali is a relatively resource-poor Indian language and has not been able to evolve from a computational perspective. Therefore, we present effective part-of-speech tagging and chunking tools for the Nepali text using sequential deep learning models-Bidirectional Long Short-Term Memory Network with a Conditional Random Field Layer (BI-LSTM-CRF) and other LSTM-based models exploring both character and word embeddings of the Nepali texts. Word Embedding has been used to capture syntactic as well as semantic information whereas character embedding has been applied to capture the morphological as well as shape information of words and also to handle the out-of-vocabulary problem. The developed chunker is the first statistical chunker for the Nepali language. A baseline model with a Conditional Random Field has also been developed to identify the optimum feature set for the aforementioned tasks. The BI-LSTM-CRF model produced an accuracy of 99.20% and 98.40%, for Nepali POS tagging and chunking, respectively. This is the highest-ever accuracy for Nepali. Thorough error analysis and observations have also been reported with examples. The developed tools can help advance research in Nepali language processing, improve the accuracy of language technology applications, and contribute to the preservation and promotion of the Nepali language.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Applying a deep learning-based sequence labeling approach to detect attributes of medical concepts in clinical text
    Jun Xu
    Zhiheng Li
    Qiang Wei
    Yonghui Wu
    Yang Xiang
    Hee-Jin Lee
    Yaoyun Zhang
    Stephen Wu
    Hua Xu
    [J]. BMC Medical Informatics and Decision Making, 19
  • [2] Deep Learning-Based Sequence Labeling for Information Extraction from Multiple Types of Textual Bridge Reports
    Chen, Qiyang
    Ei-Gohary, Nora
    [J]. COMPUTING IN CIVIL ENGINEERING 2021, 2022, : 180 - 187
  • [3] Applying a deep learning-based sequence labeling approach to detect attributes of medical concepts in clinical text
    Xu, Jun
    Li, Zhiheng
    Wei, Qiang
    Wu, Yonghui
    Xiang, Yang
    Lee, Hee-Jin
    Zhang, Yaoyun
    Wu, Stephen
    Xu, Hua
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2019, 19 (01)
  • [4] Deep Learning-Based Decoding for Constrained Sequence Codes
    Cao, Congzhe
    Li, Duanshun
    Fair, Ivan
    [J]. 2018 IEEE GLOBECOM WORKSHOPS (GC WKSHPS), 2018,
  • [5] Deep Learning-Based Decoding of Constrained Sequence Codes
    Cao, Congzhe
    Li, Duanshun
    Fair, Ivan
    [J]. IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2019, 37 (11) : 2532 - 2543
  • [6] Deep Learning-Based Methods for Sentiment Analysis on Nepali COVID-19-Related Tweets
    Sitaula, C.
    Basnet, A.
    Mainali, A.
    Shahi, T. B.
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
  • [7] Tone Labeling by Deep Learning-based Tone Recognizer for Mandarin Speech
    Li, Wu-Hao
    Chiang, Chen-Yu
    Liu, Te-hsin
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 873 - 880
  • [8] Deep sequence to sequence learning-based prediction of major disruptions in ADITYA tokamak
    Agarwal, Aman
    Mishra, Aditya
    Sharma, Priyanka
    Jain, Swati
    Daniel, Raju
    Ranjan, Sutapa
    Manchanda, Ranjana
    Ghosh, Joydeep
    Tanna, Rakesh
    [J]. PLASMA PHYSICS AND CONTROLLED FUSION, 2021, 63 (11)
  • [9] Robust deep learning-based protein sequence design using ProteinMPNN
    Dauparas, J.
    Anishchenko, I.
    Bennett, N.
    Bai, H.
    Ragotte, R. J.
    Milles, L. F.
    Wicky, B. I. M.
    Courbet, A.
    de Haas, R. J.
    Bethel, N.
    Leung, P. J. Y.
    Huddy, T. F.
    Pellock, S.
    Tischer, D.
    Chan, F.
    Koepnick, B.
    Nguyen, H.
    Kang, A.
    Sankaran, B.
    Bera, A. K.
    King, N. P.
    Baker, D.
    [J]. SCIENCE, 2022, 378 (6615) : 49 - 55
  • [10] Predicting effects of noncoding variants with deep learning-based sequence model
    Zhou, Jian
    Troyanskaya, Olga G.
    [J]. NATURE METHODS, 2015, 12 (10) : 931 - 934