A web-based Bengali news corpus for named entity recognition

被引:28
|
作者
Ekbal, Asif [1 ]
Bandyopadhyay, Sivaji [1 ]
机构
[1] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata 700032, India
关键词
web as corpus; news corpus; web-based tagged Bengali news corpus; named entity; named entity recognition;
D O I
10.1007/s10579-008-9064-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.
引用
收藏
页码:173 / 182
页数:10
相关论文
共 50 条
  • [31] Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)
    Salah, Ramzi Esmail
    Zakaria, Lailatul Qadri Binti
    2018 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP), 2018, : 150 - 157
  • [32] Building a Corpus-Derived Gazetteer for Named Entity Recognition
    Zamin, Norshuhani
    Oxley, Alan
    SOFTWARE ENGINEERING AND COMPUTER SYSTEMS, PT 2, 2011, 180 : 73 - 80
  • [33] GENETAG: a tagged corpus for gene/protein named entity recognition
    Lorraine Tanabe
    Natalie Xie
    Lynne H Thom
    Wayne Matten
    W John Wilbur
    BMC Bioinformatics, 6
  • [34] Named entity recognition through corpus transformation and system combination
    Troyano, JA
    Carrillo, V
    Enríquez, F
    Galán, FJ
    ADVANCES IN NATURAL LANGUAGE PROCESSING, 2004, 3230 : 255 - 266
  • [35] GENETAG: a tagged corpus for gene/protein named entity recognition
    Tanabe, L
    Xie, N
    Thom, LH
    Matten, W
    Wilbur, WJ
    BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
  • [36] An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition
    Hoxha, Klesti
    Baxhaku, Artur
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2018, 18 (01) : 95 - 108
  • [37] Corpus of Carbonate Platforms with Lexical Annotations for Named Entity Recognition
    Hu, Zhichen
    Ren, Huali
    Jiang, Jielin
    Cui, Yan
    Hu, Xiumian
    Xu, Xiaolong
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2023, 135 (01): : 91 - 108
  • [38] Adaptive, multilingual named entity recognition in Web pages
    Petasis, G
    Karkaletsis, V
    Grover, C
    Hachey, B
    Pazienza, MT
    Vindigni, M
    Coch, J
    ECAI 2004: 16TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 110 : 1073 - 1074
  • [39] An Adaptive Approach for Web Scale Named Entity Recognition
    Zhu, Jianhan
    2009 1ST IEEE SYMPOSIUM ON WEB SOCIETY, PROCEEDINGS, 2009, : 41 - 46
  • [40] ESpotter: Adaptive named entity recognition for web browsing
    Zhu, JH
    Uren, V
    Motta, E
    PROFESSIONAL KNOWLEDGE MANAGEMENT, 2005, 3782 : 518 - 529