An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus

被引:4
|
作者
Nguyen, Long H. B. [1 ]
Dien Dinh [1 ]
Phuoc Tran [2 ]
机构
[1] Univ Sci, Fac Informat Technol, Hcm City, Vietnam
[2] Ton Duc Thang Univ, Fac Informat Technol, Hcm City, Vietnam
关键词
Named entity translation; annotated bilingual corpus; English-Vietnamese; word alignment constraint;
D O I
10.1145/2990191
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Manually constructing an annotated Named Entity (NE) in a bilingual corpus is a time-consuming, labor-intensive, and expensive process, but this is necessary for natural language processing (NLP) tasks such as cross-lingual information retrieval, cross-lingual information extraction, machine translation, etc. In this article, we present an automatic approach to construct an annotated NE in English-Vietnamese bilingual corpus from a bilingual parallel corpus by proposing an aligned NE method. Basing this corpus on a bilingual corpus in which the initial NEs are extracted from its own language separately, the approach tries to correct unrecognized NEs or incorrectly recognized NEs before aligning the NEs by using a variety of bilingual constraints. The generated corpus not only improves the NE recognition results but also creates alignments between English NEs and Vietnamese NEs, which are necessary for training NE translation models. The experimental results show that the approach outperforms the baseline methods effectively. In the English-Vietnamese NE alignment task, the F-measure increases from 68.58% to 79.77%. Thanks to the improvement of the NE recognition quality, the proposed method also increases significantly: the F-measure goes from 84.85% to 88.66% for the English side and from 75.71% to 85.55% for the Vietnamese side. By providing the additional semantic information for the machine translation systems, the BLEU score increases from 33.04% to 45.11%.
引用
收藏
页数:17
相关论文
共 31 条
  • [21] Romantic Land: English Poetry of the Liberal Triennium. Analysis and Bilingual Annotated Corpus.
    Ruiz Mas, Jose
    HISTORIA CONSTITUCIONAL, 2021, (22): : 1077 - 1084
  • [22] A Novel Unsupervised Method for Named-Entity Identification in Resource-poor Languages Using Bilingual Corpus
    Seraj, Ramtin Mehdizadeh
    Jabbari, Fattaneh
    Khadivi, Shahram
    2014 7TH INTERNATIONAL SYMPOSIUM ON TELECOMMUNICATIONS (IST), 2014, : 519 - 523
  • [23] Tibetan-Chinese Cross Language Named Entity Extraction Based on Comparable Corpus and Naturally Annotated Resources
    Sun, Yuan
    Guo, Wenbin
    Zhao, Xiaobing
    2014 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING (CIDM), 2014, : 288 - 295
  • [24] Learning to construct English (L2) sentences in a bilingual corpus-based system
    Yang, Yu-Fen
    Wong, Wing-Kwong
    Yeh, Hui-Chin
    SYSTEM, 2013, 41 (03) : 677 - 690
  • [25] A Hybrid Approach of Pattern Extraction and Semi-supervised Learning for Vietnamese Named Entity Recognition
    Vo, Duc-Thuan
    Ock, Cheol-Young
    COMPUTATIONAL COLLECTIVE INTELLIGENCE - TECHNOLOGIES AND APPLICATIONS, PT I, 2012, 7653 : 83 - 93
  • [26] NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding
    Wang, Kanix
    Stevens, Robert
    Alachram, Halima
    Li, Yu
    Soldatova, Larisa
    King, Ross
    Ananiadou, Sophia
    Schoene, Annika M.
    Li, Maolin
    Christopoulou, Fenia
    Ambite, Jose Luis
    Matthew, Joel
    Garg, Sahil
    Hermjakob, Ulf
    Marcu, Daniel
    Sheng, Emily
    Beissbarth, Tim
    Wingender, Edgar
    Galstyan, Aram
    Gao, Xin
    Chambers, Brendan
    Pan, Weidi
    Khomtchouk, Bohdan B.
    Evans, James A.
    Rzhetsky, Andrey
    NPJ SYSTEMS BIOLOGY AND APPLICATIONS, 2021, 7 (01)
  • [27] NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding
    Kanix Wang
    Robert Stevens
    Halima Alachram
    Yu Li
    Larisa Soldatova
    Ross King
    Sophia Ananiadou
    Annika M. Schoene
    Maolin Li
    Fenia Christopoulou
    José Luis Ambite
    Joel Matthew
    Sahil Garg
    Ulf Hermjakob
    Daniel Marcu
    Emily Sheng
    Tim Beißbarth
    Edgar Wingender
    Aram Galstyan
    Xin Gao
    Brendan Chambers
    Weidi Pan
    Bohdan B. Khomtchouk
    James A. Evans
    Andrey Rzhetsky
    npj Systems Biology and Applications, 7
  • [28] Construct-Extract: An Effective Model for Building Bilingual Corpus to Improve English-Myanmar Machine Translation
    Zin, May Myo
    Racharak, Teeradaj
    Nguyen Minh Le
    ICAART: PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 2, 2021, : 333 - 342
  • [29] Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data
    Srirangam, Vamshi Krishna
    Reddy, Appidi Abhinav
    Singh, Vinay
    Shrivastava, Manish
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019:): STUDENT RESEARCH WORKSHOP, 2019, : 183 - 189
  • [30] Heuristic Bilingual Graph Corpus Network to Improve English Instruction Methodology Based on Statistical Translation Approach
    Fang, Hui
    Shi, Hongmei
    Zhang, Jiuzhou
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (03)