A web-based Bengali news corpus for named entity recognition

被引:28
|
作者
Ekbal, Asif [1 ]
Bandyopadhyay, Sivaji [1 ]
机构
[1] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata 700032, India
关键词
web as corpus; news corpus; web-based tagged Bengali news corpus; named entity; named entity recognition;
D O I
10.1007/s10579-008-9064-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.
引用
收藏
页码:173 / 182
页数:10
相关论文
共 50 条
  • [21] A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies
    Ekbal, Asif
    Bandyopadhyay, Sivaji
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2007, 4815 : 545 - 552
  • [22] Improving named entity recognition and disambiguation in news headlines
    Barua J.
    Niyogi R.
    International Journal of Intelligent Information and Database Systems, 2019, 12 (04): : 279 - 303
  • [23] Bootstrapping named entity recognition for Italian broadcast news
    Federico, M
    Bertoldi, N
    Sandrini, V
    PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2002, : 296 - 303
  • [24] Resources for Named Entity Recognition and Resolution in News Wires
    Stern, Rosa
    Sagot, Benoit
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : C27 - C32
  • [25] Relation Recognition among Named Entities from a Crime Corpus using a Web-based Semantic Similarity Measurement
    Das, Priyanka
    Das, Asit Kumar
    2017 THIRD IEEE INTERNATIONAL CONFERENCE ON RESEARCH IN COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (ICRCICN), 2017, : 303 - 308
  • [26] A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition
    Saha, Sujan Kumar
    Mitra, Pabitra
    Sarkar, Sudeshna
    KNOWLEDGE-BASED SYSTEMS, 2012, 27 : 322 - 332
  • [27] Named Entity Recognition in Bengali and Hindi Using MuRIL and Conditional Random Fields
    Kaushik Bose
    Kamal Sarkar
    SN Computer Science, 5 (7)
  • [28] A Broad-coverage Corpus for Finnish Named Entity Recognition
    Luoma, Jouni
    Oinonen, Miika
    Pyykonen, Maria
    Laippala, Veronika
    Pyysalo, Sampo
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4615 - 4624
  • [29] Assessment of disease named entity recognition on a corpus of annotated sentences
    Jimeno, Antonio
    Jimenez-Ruiz, Ernesto
    Lee, Vivian
    Gaudan, Sylvain
    Berlanga, Rafael
    Rebholz-Schuhmann, Dietrich
    BMC BIOINFORMATICS, 2008, 9 (Suppl 3)
  • [30] Assessment of disease named entity recognition on a corpus of annotated sentences
    Antonio Jimeno
    Ernesto Jimenez-Ruiz
    Vivian Lee
    Sylvain Gaudan
    Rafael Berlanga
    Dietrich Rebholz-Schuhmann
    BMC Bioinformatics, 9