A web-based Bengali news corpus for named entity recognition

被引:28
|
作者
Ekbal, Asif [1 ]
Bandyopadhyay, Sivaji [1 ]
机构
[1] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata 700032, India
关键词
web as corpus; news corpus; web-based tagged Bengali news corpus; named entity; named entity recognition;
D O I
10.1007/s10579-008-9064-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.
引用
收藏
页码:173 / 182
页数:10
相关论文
共 50 条
  • [41] Effective Named Entity Recognition for Idiosyncratic Web Collections
    Prokofyev, Roman
    Demartini, Gianluca
    Cudre-Mauroux, Philippe
    WWW'14: PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2014, : 397 - 407
  • [42] Named Entity Recognition to Detect Criminal Texts on the Web
    Skorzewski, Pawel
    Pieniowski, Mikolaj
    Demenko, Grazyna
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6223 - 6231
  • [43] Named Entity Recognition Approach for Malay Crime News Retrieval
    Saad, Saidah
    Mansor, Mohamed Kamil
    GEMA ONLINE JOURNAL OF LANGUAGE STUDIES, 2018, 18 (04): : 216 - 235
  • [44] News text named entity Recognition based on BI-LSTM-CRF model
    Meng, LingMing
    Qi, WeiMin
    Zhou, YongKang
    Chen, Ying
    2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 7217 - 7222
  • [45] Person Browser System Based on Named Entity Recognition for Broadcast News Interview Videos
    Sanghee Lee
    Kanghyun Jo
    International Journal of Control, Automation and Systems, 2021, 19 : 186 - 199
  • [46] Person Browser System Based on Named Entity Recognition for Broadcast News Interview Videos
    Lee, Sanghee
    Jo, Kanghyun
    INTERNATIONAL JOURNAL OF CONTROL AUTOMATION AND SYSTEMS, 2021, 19 (01) : 186 - 199
  • [47] Entity Recognition in Bengali Language
    Das, Sujit Kumar
    Dhar, Sourish
    2015 INTERNATIONAL SYMPOSIUM ON ADVANCED COMPUTING AND COMMUNICATION (ISACC), 2015, : 157 - 160
  • [48] Emerging Named Entity Recognition on Retrieval Features in an Affective Computing Corpus
    Nawroth, Christian
    Engel, Felix
    Mc Kevitt, Paul
    Hemmje, Matthias L.
    2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 2860 - 2868
  • [49] LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain
    Pais, Vasile
    Mitrofan, Maria
    Gasan, Carol Luca
    Ianov, Alexandru
    Ghita, Corvin
    Coneschi, Vlad Silviu
    Onut, Andrei
    SEMANTIC WEB, 2024, 15 (03) : 831 - 844
  • [50] Information Extraction based on Named Entity for Tourism Corpus
    Chantrapornchai, Chantana
    Tunsakul, Aphisit
    2019 16TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2019), 2019, : 187 - 192